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Foreword 



On behalf of the Organizing Committee, we would like to welcome you to the 
proceedings of the 23rcl International Conference on Conceptual Modeling (ER 
2004). This conference provided an international forum for technical discussion 
on conceptual modeling of information systems among researchers, developers 
and users. This was the third time that this conference was held in Asia; the 
first time was in Singapore in 1998 and the second time was in Yokohama, 
Japan in 2001. China is the third largest nation with the largest population in 
the world. Shanghai, the largest city in China and a great metropolis, famous in 
Asia and throughout the world, is therefore a most appropriate location to host 
this conference. 

This volume contains papers selected for presentation and includes the two 
keynote talks by Prof. Hector Garcia-Molina and Prof. Gerhard Weikum, and 
an invited talk by Dr. Xiao Ji. 

This volume also contains industrial papers and demo/poster papers. An 
additional volume contains papers from 6 workshops. 

The conference also featured three tutorials: (1) Web Change Management 
and Delta Mining: Opportunities and Solutions, by Sanjay Madria, (2) A Survey 
of Data Quality Issues in Cooperative Information Systems, by Carlo Batini, 
and (3) Visual SQL - An ER.-Based Introduction to Database Programming, by 
Bernhard Thallreim. 

The technical program of the conference was selected by a distinguished 
program committee consisting of three PC Co-chairs, Hongjun Lu, Wesley Chu, 
and Paolo Atzeni, and more than 70 members. They faced a difficult task in 
selecting 57 papers from many very good contributions. This year the number of 
submissions, 293, was a record high for ER conferences. We wish to express our 
thanks to the program committee members, external reviewers, and all authors 
for submitting their papers to this conference. 

We would also like to thank: the Honorary Conference Chairs, Peter P. Chen 
and Ruqian Lu; the Coordinators, Zlrongzhi Shi, Yoshifumi Masunaga, Elisa 
Bertino, and Carlo Zaniolo; Workshop Co-chairs, Shan Wang and Katsumi Tana- 
ka; Tutorial Co-chairs, Jianzhong Li and Stefano Spaccapietra; Panel Co-chairs, 
Clrin-Chen Chang and Erich Neulrold; Industrial Co-chairs, Philip S. Yu, Jian 
Pei, and Jiansheng Feng; Demos and Posters Co-clrair, Mong-Li Lee and Gillian 
Dobbie; Publicity Chair, Qing Li; Publication Chair cum Local Arrangements 
Chair, Shuigeng Zhou; Treasurer, Xueqing Gong; Registration Chair, Xiaoling 
Wang; Steering Committee Liaison, Arne Solvberg; and Webmasters, Kun Yue, 
Yizhong Wu, Zhimao Guo, and Keping Zhao. 

We wish to extend our thanks to the Natural Science Foundation of China, 
the ER Institute (ER Steering Committee), the K.C. Wong Education Founda- 
tion in Hong Kong, the Database Society of the China Computer Federation, 
ACM SIGMOD, ACM SIGMIS, IBM China Co., Ltd., Shanghai Baosight Soft- 
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Foreword 



ware Co., Ltd., and the Digital Policy Management Association of Korea for 
their sponsorships and support. 

At this juncture, we wish to remember the late Prof. Yahiko Kambayashi who 
passed away on February 5, 2004 at age 60 and was then a workshop co-chair of 
the conference. Many of us will remember him as a friend, a mentor, a leader, 
an educator, and our source of inspiration. We express our heartfelt condolence 
and our deepest sympathy to his family. 

We hope that the attendees found the technical program of ER 2004 to be 
interesting and beneficial to their research. We trust they enjoyed this beautiful 
city, including the night scene along the Huangpujiang River and the postcon- 
ference tours to the nearby cities, leaving a beautiful and memorable experience 
for all. 



November 2004 Tok Wang Ling 

Aoying Zhou 




Preface 



The 23rd International Conference on Conceptual Modeling (ER 2004) was held 
in Shanghai, China, November 8-12, 2004. Conceptual modeling is a fundamental 
technique used in analysis and design as a real-world abstraction and as the 
basis for communication between technology experts and their clients and users. 
It has become a fundamental mechanism for understanding and representing 
organizations, including new e- worlds, and the information systems that support 
them. 

The International Conference on Conceptual Modeling provides a major fo- 
rum for presenting and discussing current research and applications in which 
conceptual modeling is the major emphasis. Since the first edition in 1979, the 
ER conference has evolved into the most prestigious one in the areas of concep- 
tual modeling research and applications. Its purpose is to identify challenging 
problems facing high-level modeling of future information systems and to shape 
future directions of research by soliciting and reviewing high-quality applied and 
theoretical research findings. ER 2004 encompassed the entire spectrum of con- 
ceptual modeling. It addressed research and practice in areas such as theories 
of concepts and ontologies underlying conceptual modeling, methods and tools 
for developing and communicating conceptual models, and techniques for trans- 
forming conceptual models into effective information system implementations. 

We solicited forward-looking and innovative contributions that identify 
promising areas for future conceptual modeling research as well as traditional 
approaches to analysis and design theory for information systems development. 

The Call for Papers attracted 295 exceptionally strong submissions of re- 
search papers from 36 countries/regions. Due to limited space, we were only able 
to accept 57 papers from 21 countries/regions, for an acceptance rate of 19.3%. 
Inevitably, many good papers had to be rejected. The accepted papers covered 
topics such as ontologies, patterns, workflows, metamodeling and methodology, 
innovative approaches to conceptual modeling, foundations of conceptual mod- 
eling, advanced database applications, systems integration, requirements and 
evolution, queries and languages, Web application modeling and development, 
schemas and ontologies, and data mining. 

We are proud of the quality of this year’s program, from the keynote speeches 
to the research papers, with the workshops, panels, tutorials, and industrial pa- 
pers. We were honored to host the outstanding keynote addresses by Hector 
Garcia-Molina and Gerhard Weikum. We appreciate the hard work of the or- 
ganizing committee, with interactions around the clock with colleagues all over 
the world. Most of all, we are extremely grateful to the program committee 
members of ER 2004 who generously spent their time and energy reviewing sub- 
mitted papers. We also thank the many external referees who helped with the 
review process. Last but not least, we thank the authors who wrote high-quality 
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research papers and submitted them to ER 2004, without whom the conference 
would not have existed. 



November 2004 



Paolo Atzeni, Wesley Chu, and Hongjun Lu 
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Entity Resolution: Overview and Challenges 



Hector Garcia-Molina 

Stanford University, Stanford, CA, USA 
hector@cs . Stanford. edu 



Entity resolution is a problem that arises in many information integration scenarios: We 
have two or more sources containing records on the same set of real-world entities (e.g., 
customers). However, there are no unique identifiers that tell us what records from one 
source correspond to those in the other sources. Furthermore, the records representing 
the same entity may have differing information, e.g., one record may have the address 
misspelled, another record may be missing some fields. An entity resolution algorithm 
attempts to identify the matching records from multiple sources (i.e., those correspond- 
ing to the same real-world entity), and merges the matching records as best it can. Entity 
resolution algorithms typically rely on user-defined functions that (a) compare fields or 
records to determine if they match (are likely to represent the same real world entity), 
and (b) merge matching records into one, and in the process perhaps combine fields 
(e.g., creating a new name based on two slightly different versions of the name). 

In this talk I will give an overview of the Stanford SERF Project, that is building a 
framework to describe and evaluate entity resolution schemes. In particular, I will give 
an overview of some of the different entity resolution settings: 

- De-duplication versus fidelity enhancement. In the de-duplication problem, we have 
a single set of records, and we try to merge the ones representing the same real 
world entity. In the fidelity enhancement problem, we have two sets of records: a 
base set of records of interest, and a new set of acquired information. The goal is to 
coalesce the new information into the base records. 

- Clustering versus snapping. With snapping, we examine records pair-wise and de- 
cide if they represent the same entity. If they do, we merge the records into one, 
and continue the process of pair-wise comparisons. With clustering, we analyze all 
records and partition them into groups we believe represent the same real world 
entity. At the end, each partition is merged into one record. 

- Confidences. In some entity resolution scenarios we must manage confidences. For 
example, input records may have a confidence value representing how likely it is 
they are true. Snap rules (that tells us when two records match) may also have 
confidences representing how likely it is that two records actually represent the 
same real world entity. As we merge records, we must track their confidences. 

- Schema Mismatches. In some entity resolution scenarios we must deal, not just with 
resolving information on entities, but also with resolving discrepancies among the 
schemas of the different sources. For example, the attribute names and formats from 
one source may not match those of other sources. 

In the talk I will address some of the open problems and challenges that arise in 
entity resolution. These include: 
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- Performance. Entity resolution algorithms must perform very large number of field 
and record comparisons (via the user provided functions), so it is critical to perform 
only the absolutely minimum number of invocations to the comparison functions. 
Developing efficient algorithms is analogous to developing efficient join algorithms 
for relational databases. 

- Confidences. Very little is understood as to how confidences should be manipulated 
in an entity resolution setting. For example, say we have two records, one reporting 
that "Joe” uses cell phone 123, and the other reporting that “Joseph” uses phone 
456. The first record has confidence 0.9 and the second one 0.7. A snap rule tells us 
that “Joe” and “Joseph” are the same person with confidence 0.8. Do we assume this 
person has been using two phones? Or that 123 is the correct number because that 
record has a higher confidence? If we do merge the records, what are the resulting 
confidences? 

- Metrics. Say we have two entity resolution schemes, A and B. How do we know 
if A yields “better” results and compared to B! Or say we have one base set of 
records, and we wish to enhance its fidelity with either new set X or new set Y. 
Since it costs money to acquire either new set, we only wish to use one. Based 
on samples of X and Y, how do we decide which set is more likely to enhance 
our base set? To address questions such as these we need to develop metrics that 
quantify not just to performance of entity resolution, but also its accuracy. 

- Privacy. There is a strong connection between entity resolution and information 
privacy. To illustrate, say Alice has given out two records containing some of her 
private information: Record 1 gives Alice’s name, phone number and credit card 
number; record 2 gives Alice’s name, phone and national identity number. How 
much information has actually “leaked” depends on how well and adversary. Bob, 
can piece together these two records. If Bob can determine that the records refer 
to the same person, then he knows Alice’s credit card number and her national 
identity number, opening the door for say identity theft. If the records do not snap 
together, then Bob knows less and we have a smaller information leak. We need to 
develop good ways to model information leakage in an entity resolution context. 
Such a model can lead us, for example, to techniques for quantifying the leakage 
caused by releasing one new fact, or the decrease in leakage caused by releasing 
disinformation. 

Additional information on our SERF project can be found at 
http://www-db.stanford.edu/serf 

This work is joint with Qi Su, Tyson Condie, Nicolas Pombourcq, and Jennifer Widom. 
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Abstract. The envisioned Semantic Web aims to provide richly annotated and 
explicitly structured Web pages in XML, RDF, or description logics, based upon 
underlying ontologies and thesauri. Ideally, this should enable a wealth of query 
processing and semantic reasoning capabilities using XQuery and logical infer- 
ence engines. However, we believe that the diversity and uncertainty of termi- 
nologies and schema-like annotations will make precise querying on a Web scale 
extremely elusive if not hopeless, and the same argument holds for large-scale 
dynamic federations of Deep Web sources. Therefore, ontology-based reasoning 
and querying needs to be enhanced by statistical means, leading to relevance- 
ranked lists as query results. 

This paper presents steps towards such a “statistically semantic” Web and outlines 
technical challenges. We discuss how statistically quantified ontological relations 
can be exploited in XML retrieval, how statistics can help in making Web-scale 
search efficient, and how statistical information extracted from users" query logs 
and click streams can be leveraged for better search result ranking. We believe 
these are decisive issues for improving the quality of next-generation search en- 
gines for intranets, digital libraries, and the Web, and they are crucial also for 
peer-to-peer collaborative Web search. 



1 The Challenge of “Semantic” Information Search 

The age of information explosion poses tremendous challenges regarding the intelligent 
organization of data and the effective search of relevant information in business and in- 
dustry (e.g., market analyses, logistic chains), society (e.g., health care), and virtually 
all sciences that are more and more data-driven (e.g., gene expression data analyses and 
other areas of bioinformatics). The problems arise in intranets of large organizations, in 
federations of digital libraries and other information sources, and in the most humon- 
gous and amorphous of all data collections, the World Wide Web and its underlying 
numerous databases that reside behind portal pages. The Web bears the potential of 
being the world’s largest encyclopedia and knowledge base, but we are very far from 
being able to exploit this potential. 

Database-system and search-engine technologies provide support for organizing 
and querying information; but all too often they require excessive manual preprocess- 
ing, such as designing a schema and cleaning raw data or manually classifying docu- 
ments into a taxonomy for a good Web portal, or manual postprocessing such as brows- 
ing through large result lists with too many irrelevant items or surfing in the vicinity 
of promising but not truly satisfactory approximate matches. The following are a few 
example queries where current Web and intranet search engines fall short or where data 

P. Atzeni et at. (Eds.): ER 2004, LNCS 3288, pp. 3-17, 2004. 

© Springer- Verlag Berlin Heidelberg 2004 




4 



Gerhard Weikum et al. 



integration techniques and the use of SQL-like querying face unsurmountable difficul- 
ties even on structured, but federated and highly heterogeneous databases: 

Ql: Which professors from Saarbruecken in Germany teach information retrieval and 
do research on XML? 

Q2: Which gene expression data from Barrett tissue in the esophagus exhibit high lev- 
els of gene AOlg? And are there any metabolic models for acid reflux that could 
be related to the gene expression data? 

Q3: What are the most important research results on large deviation theory? 

Q4: Which drama has a scene in which a woman makes a prophecy to a Scottish no- 
bleman that he will become king? 

Q5: Who was the French woman that I met in a program committee meeting where 
Paolo Atzeni was the PC chair? 

Q6: Are there any published theorems that are equivalent to or subsume my latest math- 
ematical conjecture? 

Why are these queries difficult (too difficult for Google-style keyword search unless 
one invests a huge amount of time to manually explore large result lists with mostly ir- 
relevant and some mediocre matches)? For Ql no single Web site is a good match; rather 
one has to look at several pages together within some bounded context: the homepage of 
a professor with his address, a page with course information linked to by the homepage, 
and a research project page on semistructured data management that is a few hyper- 
links away from the homepage. Q2 would be easy if asked for a single bioinformatics 
database with a familiar query interface, but searching the answer across the entire Web 
and Deep Web requires discovering all relevant data sources and unifying their query 
and result representations on the fly. Q3 is not a query in the traditional sense, but re- 
quires gathering a substantial number of key resources with valuable information on the 
given topic; it would be best served by looking up a well maintained Yahoo-style topic 
directory, but highly specific expert topics are not covered there. Q4 cannot be easily 
answered because a good match does not necessarily contain the keywords “woman”, 
“prophecy”, “nobleman”, etc., but may rather say something like “Third witch: All hail, 
Macbeth, thou shalt be king hereafter!” and the same document may contain the text 
“All hail, Macbeth! hail to thee, thane of Glamis!”. So this query requires some back- 
ground knowledge to recognize that a witch is a woman, “shalt be” refers to a prophecy, 
and thane is a title for a Scottish nobleman. Q5 is similar to Q4 in the sense that it also 
requires background knowledge, but it is more difficult because it additionally requires 
putting together various information fragments: conferences on which I served on the 
PC found in my email archive, PC members of conferences found on Web pages, and 
detailed information found on researchers’ homepages. And after having identified a 
candidate like Sophie Cluet from Paris, one needs to infer that Sophie is a typical fe- 
male first name and that Paris most likely denotes the capital of France rather than the 
500-inhabitants town of Paris, Texas, that became known through a movie, Q6 finally 
is what some researchers call “Al-complete”, it will remain a challenge for a long time. 

For a human expert who is familiar with the corresponding topics, none of these 
queries is really difficult. With unlimited time, the expert could easily identify rele- 
vant pages and combine semantically related information units into query answers. The 
challenge is to automate or simulate these intellectual capabilities and implement them 
so that they can handle billions of Web pages and petabytes of data in structured (but 
schematically highly diverse) Deep-Web databases. 




Towards a Statistically Semantic Web 



5 



2 The Need for Statistics 

What if all Web pages and all Web-accessible data sources were in XML, RDF, or OWL 
(a description-logic representation) as envisioned in the Semantic Web research direc- 
tion [25, 1 1 ? Would this enable a search engine to effectively answer the challenging 
queries of the previous section? And would such an approach scale to billions of Web 
pages and be efficient enough for interactive use? Or could we even load and integrate 
all Web data into one gigantic database and use XQuery for searching it? 

XML, RDF, and OWL offer ways of more explicitly structuring and richly annotat- 
ing Web pages. When viewed as logic formulas or labeled graphs, we may think of the 
pages as having “semantics”, at least in terms of model theory or graph isomorphisms 1 . 
In principle, this opens up a wealth of precise querying and logical inferencing op- 
portunities. However, it is extremely unlikely that all pages will use the very same tag 
or predicate names when they refer to the same semantic properties and relationships. 
Making such an assumption would be equivalent to assuming a single global schema: 
this would be arbitrarly difficult to achieve in a large intranet, and it is completely 
hopeless for billions of Web pages given the Web’s high dynamics, extreme diversity 
of terminology, and uncertainty of natural language (even if used only for naming tags 
and predicates). There may be standards (e.g., XML schemas) for certain areas (e.g., 
for invoices or invoice-processsing Web Services), but these will have limited scope 
and influence. A terminologically unified and logically consistent Semantic Web with 
billions of pages is hard to imagine. 

So reasoning about diversely annotated pages is a necessity and a challenge. Simi- 
larly to the ample research on database schema integration and instance matching (see, 
e.g., [49] and the references given there), knowledge bases [50], lexicons, thesauri [24], 
or ontologies [58] are considered as the key asset to this end. Here an ontology is un- 
derstood as a collection of concepts with various semantic relationships among them; 
the formal representation may vary from rigorous logics to natural language. The most 
important relationship types are hyponymy (specialization into narrower concepts) and 
hypernymy (generalization into broader concepts). 

To the best of my knowledge, the most comprehensive, publicly available kind of 
ontology is the WordNet thesaurus hand-crafted by cognitive scientists at Princeton 
[24], For the concept “woman” WordNet lists about 50 immediate hyponyms, which 
include concepts like “witch” and “lady” which could help to answer queries like Q4 
from the previous section. However, regardless of whether one represents these hy- 
ponymy relationships in a graph-oriented form or as logical formulas, such a rigid “true- 
or-false” representation could never discriminate these relevant concepts from the other 
48 irrelevant and largely exotic hyponyms of “woman”. In information-retrieval (1R) 
jargon, such an approach would be called Boolean retrieval or Boolean reasoning; and 
IR almost always favors ranked retrieval with some quantitative relevance assessment. 
In fact, by simply looking at statistical correlations of using words like “woman” and 
“lady” together in some text neighborhood within large corpora (e.g., the Web or large 
digital libraries) one can infer that these two concepts are strongly related, as opposed 
to concepts like “woman” and “siren”. Similarly, mere statistics strongly suggests that 

1 Some people may argue that all computer models are mere syntax anyway, but this is in the 
eye of the beholder. 
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a city name "Paris” denotes the French capital and not Paris, Texas. Once making a 
distinction of strong vs. weak relationships and realizing that this is a full spectrum, it 
becomes evident that the significance of semantic relationships needs to be quantified 
in some manner, and the by far best known way of doing this (in terms of rigorous 
foundation and rich body of results) is by using probability theory and statistics. 

This concludes my argument for the necessity of a “statistically semantic” Web. The 
following sections substantiate and illustrate this point by sketching various technical 
issues where statistical reasoning is key. Most of the discussion addresses how to handle 
non-schematic XML data; this is certainly still a good distance from the Semantic Web 
vision, but it is a decent and practically most relevant first step. 



3 Towards More “Semantics” in Searching XML and Web Data 

Non-schematic XML data that comes from many different sources and inevitably ex- 
hibits heterogeneous structures and annotations (i.e,, XML tags) cannot be adequately 
searched using database query languages like XPath or XQuery. Often, queries either 
return too many or too few results. Rather the ranked-retrieval paradigm is called for, 
with relaxable search conditions, various forms of similarity predicates on tags and con- 
tents, and quantitative relevance scoring. Note that the need for ranking goes beyond 
adding Boolean text-search predicates to XQuery. In fact, similarity scoring and rank- 
ing are orthogonal to data types and would be desirable and beneficial also on structured 
attributes such as time (e.g., approximately in the year 1790), geographic coordinates 
(e.g., near Paris), and other numerical and categorical data types (e.g., numerical sensor 
readings and music style categories). 

Research on applying IR techniques to XML data has started five years ago with the 
work [26, 55, 56, 60] and has meanwhile gained considerable attention. This research 
avenue includes approaches based on combining ranked text search with XPath-style 
conditions [4,13,35,11,31,38], structural similarities such as tree-editing distances 
[5,54,69, 14], ontology-enhanced content similarities [60,61,52], and applying proba- 
bilistic IR and statistical language models to XML [28, 2]. 

Our own approach, the XXL 2 query language and search engine [60, 61, 52], com- 
bines a subset of XPath with a similarity operator ~ that can be applied to element or 
attribute names, on one hand, and element or attribute contents, on the other hand. For 
example, the queries Q1 and Q4 of Section 1 could be expressed in XXL as follows 
(and executed on a heterogeneous collection of XML documents): 

Q1 : Select * From Index Q4 : Select * From Index 

Where "professor As P Where "drama// scene As S 

And P = " Saarbruecken" And S//~ speaker = ""woman" 

And P//~course = "~IR" And S//~speech = "king" 

And P//~research = "~XML" And S//~person = ""nobleman" 

Here XML data is interpreted as a directed graph, including href or XLink/XPointer 
links within and across documents that go beyond a merely tree-oriented approach. End 
nodes of connections that match a path condition such as drama / / scene are bound 
to node variables that can be referred to in other search conditions. Content conditions 

2 Flexible XML Search Language. 




Towards a Statistically Semantic Web 



7 



such as = "~woman" are interpreted as keyword queries on XML elements, using 
IR-style measures (based on statistics like term frequencies and inverse element fre- 
quencies) for scoring the relevance of an element. In addition and most importantly, we 
allow expanding the query by adding “semantically” related terms taken from an on- 
tology. In the example, “woman” could be expanded into “woman wife lady girl witch 
. . . ”. The score of a relaxed match, say for an element containing “witch”, is the prod- 
uct of the traditional score for the query “witch" and the ontological similarity of the 
query term and the related term, sim(woman, witch ) in the particular example. Ele- 
ment (or attribute) name conditions such as ^course are analogously relaxed, so that, 
for example, tag names “teaching”, “class”, or “seminar” would be considered as ap- 
proximate matches. Here the score is simply the ontological similarity, for tag names 
are only single words or short composite words. The result of an entire query is a ranked 
list of subgraphs of the XML data graph, where each result approximately matches all 
query conditions with the same binding of all variables (but different results have differ- 
ent bindings). The total score of a result is computed from the scores of the elementary 
conditions using a simple probabilistic model with independence assumptions, and the 
result ranking is in descending order of total scores. 

Query languages of this kind work nicely on heterogeneous and non-schematic 
XML data collections, but the Web and also large fractions of intranets are still mostly in 
HTML, PDF, and other less structured formats. Recently we have started to apply XXL- 
style queries also to such data by automatically converting Web data into XML format. 
The COMPASS 3 search engine that we have been building supports XML ranked re- 
trieval on the full suite of Web and intranet data including combined data collections 
that include both XML documents and Web pages [32]. For example, query Q1 can be 
executed on an index that is built over all of DBLP (cast into XML) and the crawled 
homepages of all authors and other Web pages reachable through hyperlinks. Figure 1 
depicts the visual formulation of query Q1 . Like in the original XXL engine, conditions 
with the similarity operator ~ are relaxed using statistically quantified relationships 
from the ontology. 



The COMPASS Search Engine 

Sated V* irttoto* a ; Index NanttjwEB Ontology Smite* ny ThfMtwM Q i Max. MMtulisfMQe 




Fig. 1 . Visual COMPASS Query 
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The conversion of HTML and other formats into XML is based on relatively simple 
heuristic rules, for example, casting HTML headings into XML element names. For 
additional automatic annotation we use the information extraction component ANNIE 
that is part of the GATE System developed at the University of Sheffield [20]. GATE 
offers various modules for analyzing, extracting, and annotating text; its capabilities 
range from part-of-speech tagging (e.g., for noun phrases, temporal adverbial phrases, 
etc.) and lexicon lookups (e.g., for geographic names) to finite state transducers for an- 
notations based on regular expressions (e.g., for dates or currency amounts). One par- 
ticularly useful and fairly light-weight component is the Gazetteer Module for named 
entity recognition based on part-of-speech tagging and a large dictionary containing 
names of cities, countries, person names (e.g., common first names), etc. This way one 
can automatically generate tags like <location> and <person>. For example, we 
were able to annotate the popular Wikipedia open encyclopdia corpus this way, gener- 
ating about 2 million person and location tags. And this is the key for more advanced 
“semantics-aware” search on the current Web. For example, searching for Web pages 
about the physicist Max Planck would be phrased as person = "Max Planck", 
and this would eliminate many spurious matches that a Google-style keyword query 
“Max Planck” would yield about Max Planck Institutes and the Max Planck Society 4 . 

There is a rich body of research on information extraction from Web pages and 
wrapper generation. This ranges from purely logic-based or pattern-matching-driven 
approaches (e.g., [51, 17, 6, 30]) to techniques that employ statistical learning (e.g., Hid- 
den Markov Models) (e.g., [15, 16, 39,57,40]) to infer structure and annotations when 
there is too much diversity and uncertainty in the underlying data. As long as all pages 
to be wrapped come from the same data source (with some hidden schema), the logic- 
based approaches work very well. However, when one tries to wrap all homepages of 
DBLP authors or the course programs of all computer science departments in the world, 
uncertainty is inevitable and statistics-driven techniques are the only viable ones (unless 
one is willing to invest a lot of manual work for traditional schema integration, writing 
customized wrappers and mappers). 

Despite advertising our own work and mentioning our competitors, the current re- 
search projects on combining IR techniques and statistical learning with XML query- 
ing is still in an early stage and there are certainly many open issues and opportunities 
for further research. These include better theoretical foundations for scoring models 
on semistructured data, relevance feedback and interactive information search, and, of 
course, all kinds of efficiency and scalability aspects. Applying XML search techniques 
to Web data is in its infancy; studying what can be done with named-entity recog- 
nition and other automatic annotation techniques and understanding the interplay of 
queries with such statistics-based techniques for better information organization are 
widely open fields. 



4 Statistically Quantified Ontologies 

The important role of ontologies in making information search more “semantics-aware” 
has already been emphasized. In contrast to most ongoing efforts for Semantic-Web on- 

4 Germany's premier scientific society, which encompasses 80 institutes in all fields of science. 




Towards a Statistically Semantic Web 



9 



tologies, our work has focused on quantifying the strengths of semantic relationships 
based on corpus statistics [52, 59] (see also the related work [10, 44, 22, 36] and further 
references given there). In contrast to early IR work on using thesauri for query expan- 
sion (e.g., [64]), the ontology itself plays a much more prominent role in our approach 
with carefully quantified statistical similarities among concepts. 

Consider a graph of concepts, each characterized by a set of synonyms and, op- 
tionally, a short textual description, connected by “typed” edges that represent different 
kinds of relationships: hypernyms and hyponyms (generalization and specialization, 
aka. is-a relations), holonyms and meronyms (part-of relations), is-instance-of relations 
(e.g., Cinderella being an instance of a fairytale or IBM Thinkpad being a notebook), 
to name the most important ones. 

The first step in building an ontology is to create the nodes and edges. To this end, 
existing thesauri, lexicons, and other sources like geographic gazetteers (for names of 
countries, cities, rivers, etc. and their relationships) can be used. In our work we made 
use of the WordNet thesaurus [24] and the Alexandria Digital Library Gazetteer [3], and 
also started extracting concepts from page titles and href anchor texts in the Wikipedia 
encyclopedia. One of the shortcomings of WordNet is its lack of instances knowledge, 
for example, brand names and models of cars, cameras, computers, etc. To further en- 
hance the ontology, we crawled Web pages with HTML tables and forms, trying to 
extract relationships between table-header column and form-field names and the values 
in table cells and the pulldown menus of form fields. Such approaches are described 
in the literature (see, e.g., [21, 63, 68]). Our experimental findings confirmed the poten- 
tial value of these techniques, but also taught us that careful statistical thresholding is 
needed to eliminate noise and incorrect inferencing, once again a strong argument for 
the use of statistics. 

Once the concepts and relationships of a graph-based ontology are constructed, the 
next step is to quantify the strengths of semantic relationships based on corpus statistics. 
To this end we have performed focused Web crawls and use their results to estimate 
statistical correlations between the characteristic words of related concepts. One of the 
measures for the similarity of concepts cl and c2 that we used is the Dice coefficient 



Dice(cl, c2) 



2 | {docs with cl} D {docs with c2}j 
|{docs with cl} | + | {docs with c2}| 



In this computation we represent concept c by the terms taken from its set of syn- 
onyms and its short textual description (i.e., the WordNet gloss). Optionally, we can add 
terms from neighbors or siblings in the ontological graph. A document in the corpus is 
considered to contain concept c if it contains at least one word of the term set for c, 
and considered to contain both cl and c2 if it contains at least one word from each of 
the two term sets. This is a heuristics; other approaches are conceivable which we are 
investigating. 

Following this methodology, we constructed an ontolgy service [59] that is accessi- 
ble via Java RMI or as a SOAP-based Web Service described in WSDL. The service is 
used in the COMPASS search engine [32], but also in other projects. Figure 2 shows a 
screenshot from our ontology visualization tool. 

One of the difficulties in quantifying ontological relationships is that we aim to 
measure correlations between concepts but merely have statistical information about 
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Fig. 2. Ontology Visualization 



correlations between words. Ideally, we should first map the words in the corpus onto 
the corresponding concepts, i.e., their correct meanings. This is known as the word 
sense disambiguation problem in natural language processing [45], obviously a very 
difficult task because of polysemy. If this were solved it would not only help in deriv- 
ing more accurate statistical measures for “semantic” similarities among concepts but 
could also potentially boost the quality of search results and automatic classification of 
documents into topic directories. Our work [59] presents a simple but scalable approach 
to automatically mapping text terms onto ontological concepts, in the context of XML 
document classification. Again, statistical reasoning, in combination with some degree 
of natural language parsing, is key to tackling this difficult problem. 

Ontology construction is a highly relevant research issue. Compared to the ample 
work on knowledge representations for ontological information, the aspects of how to 
“populate” an ontology and how to enhance it with quantitative similarity measures 
have been underrated and deserve more intensive research. 

5 Efficient Top-k Query Processing with Probabilistic Pruning 

For ranked retrieval of semistructured, “semantically” annotated data, we face the prob- 
lem of reconciling efficiency with result quality. Usually, we are not interested in a 
complete result but only in the top-k results with the highest relevance scores. The 
state-of-the-art algorithm for top-k queries on multiple index lists, each sorted in de- 
scending order of relevance scores, is the Threshold Algorithm, TA for short [23,33, 
47]. It is applicable to both relational data such as product catalogs and text documents 
such as Web data. In the latter case, the fact that TA performs random accesses on very 
long, disk-resident index lists (e.g., all URLs or document ids for a frequently occurring 
word), with only short prefixes of the lists in memory, makes TA much less attractive, 
however. 
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In such a situtation, the TA variant with sorted access only, coined NRA (no random 
accesses), stream-combine, or TA-sorted in the literature, is the method of choice [23, 
34]. TA-sorted works by maintaining lower bounds and upper bounds for the scores of 
the top-k candidates that are kept in a priority queue in memory while scanning the in- 
dex lists. The algorithm can safely stop when the lower bound for the score of the rank-k 
result is at least as high as the highest upper bound for the scores of the candidates that 
are not among the current top-k. Unfortunately, albeit theoretically instance-optimal for 
computing a precise top-k result [23], TA-sorted tends to degrade in performance when 
operating on a large number of index lists. This is exactly the case when we relax query 
conditions such as ^speaker = ~woman using semantically related concepts from 
the ontology 5 . Even if the relaxation uses a threshold for the similarity of related con- 
cepts, we may often arrive at query conditions with 20 to 50 search terms. 

Statistics about the score distributions in the various index lists and some probabilis- 
tic reasoning help to overcome this efficiency problem and re-gain performance. In TA- 
sorted a top-k candidate d that has already been seen in the index lists in E(d) C [l..m], 
achieving score Sj(d) in list j (0 < Sj(d) < 1), and has unknown scores in the index 
lists [1 ..to] — E(d), satisfies: 

lowerb(d) = Sj(d) < s(d) < E »;«+ E highj = upperb(d) 

j£E{d) jSE(d) j<£E(d) 

where s(d) denotes the total, but not yet known, score that d achieves by summing 
up the scores from all index lists in which d occurs, lowerb(d) and upperb(d) are the 
lower and upper bounds of d’s score, and highj is the score that was last seen in the 
scan of index list j, upper-bounding the score that any candidate may obtain in list j. 
A candidate d remains a candidate as long as upperb(d) > lowerb(rank-k) where 
rank-k is the candidate that currently has rank k with regard to the candidates’ lower 
bounds (i.e., the worst one among the current top-k). Assuming that d can achieve a 
score highj in all lists in which it has not yet been encountered is conservative and, 
almost always, overly conservative. Rather we could treat these unknown scores as 
random variables Sj ( j ^ E(d)), and estimate the probability that d’s total score can 
exceed lowerb(rank-k). Then d is discarded from the candidate list if 

P[lowerb(d) + Sj > lowerb(rank-k)\ < S 
j$E(d) 

with some pruning threshold <5. 

This probabilistic interpretation makes some small, but precisely quantifiable, po- 
tential error in that it could dismiss some candidates too early. Thus, the top-k result 
computed this way is only approximate. However, the loss in precision and recall, rel- 
ative to the exact top-k result using the same index lists, is stochastically bounded and 
can be set according to the application’s needs. A value of <5 = 0.1 seems to be accept- 
able in most situations. Technically, the approach requires computing the convolution 

5 Note that the TA and TA-sorted algorithms can be easily modified to handle both element- 
name and element-contents conditions (as opposed to mere keyword sets in standard IR and 
Web search engines). 
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of the random variables Sj, based on assumed distributions (with parameter fitting) or 
precomputed histograms for the individual index lists and taking into account the cur- 
rent highj values, and predicting the (l-b)-quantile of the sum’s distribution. Details 
of the underlying mathematics and the implementation techniques for this Prob-sorted 
method can be found in [62]. Experiments with the TREC-12 .Gov corpus and the 
IMDB data collection have shown that such a probabilistic top-k method gains about a 
factor of ten (and sometimes more) in run-time compared to TA-sorted. 

The outlined algorithm for approximate top-k queries with probabilistic guarantees 
is a versatile building block for XML ranked retrieval. In combination with ontology- 
based query relaxation, for example, expanding ~woman into (woman or wife 
or witch) , it can add index lists dynamically and incrementally, rather than having 
to expand the query upfront based on thresholds. To this end, the algorithm consid- 
ers the ontological similarity sim(i,j) between concept i from the original query and 
concept j in the relaxed query, and multiplies it with the highj value of index list j 
to obtain an upper bound for the score (and characterize the score distribution) that a 
candidate can obtain from the relaxation j. This information is dynamically combined 
with the probabilistic prediction of the other unknown scores and their sum. 

The algorithm can also be combined with distance-aware path indexes for XML data 
(e.g., the HOPI index structure [53]). This is required when queries contain element- 
name and element-contents conditions as well as path conditions of the form 
prof essor//course where matches for “course” that are close to matches for 
“professor” should be ranked higher than matches that are far apart. Thus, the Prob- 
sorted algorithm covers a large fraction of an XML ranked retrieval engine. 

6 Exploiting Collective Human Input 

The statistical information considered so far refers to data (e.g., scores in index lists) 
or metadata (e.g., ontological similarities). Yet another kind of statistics is information 
about user behavior. This could include relatively static properties like bookmarks or 
embedded hyperlinks pointing to high-quality Web pages, but also dynamic properties 
inferred from query logs and click streams. For example, Google’s PageRank views a 
Web page as more important if it has many incoming links and the sources of these 
links are themselves high authorities [9, 12]. Technically, this amounts to computing 
stationary probabilities for a Markov-chain model that mimics a “random surfer”. What 
PageRank essentially does is to exploit the intellectual endorsements that many human 
users (or Web administrators on behalf of organizations) provide by means of hyper- 
links. 

This rationale can be carried over to analyzing and exploiting entire surf trails and 
query logs of individual users or an entire user community. These trails, which can 
be gathered from browser histories, local proxies, or Web servers, capture implicit user 
judgements. For example, suppose a user clicks on a specific subset of the top 10 results 
returned by a search engine for a query with several keywords, based on having seen 
the summaries of these pages. This implicit form of relevance feedback establishes a 
strong correlation between the query and the clicked-on pages. Further suppose that the 
user refines a query by adding or replacing keywords, e.g., to eliminate ambiguities in 
the previous query. Again, this establishes correlations between the new keywords and 
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the subsequently clicked-on pages, but also, albeit possibly to a lesser extent, between 
the original query and the eventually relevant pages. 

We believe that observing and exploiting such user behavior is a key element in 
adding more “semantic” or “cognitive” quality to a search engine. The literature con- 
tains some very interesting work in this direction (e.g., [19,65,67]), but is rather pre- 
liminary at this point. Perhaps, the difficulties in obtaining comprehensive query logs 
and surf trails outside of big service providers is a limiting factor in this line of ex- 
perimental research. Our own, very recent, work generalizes the notion of a “random 
surfer” into a “random expert user” by enhancing the underlying Markov chain to in- 
corporate also query nodes and transitions from queries to query refinements as well as 
clicked-on documents. Transition probabilities are derived from the statistical analysis 
of query logs and click streams. The resulting Markov chain converges to stationary 
authority scores that reflect not only the link structure but also the implicit feedback 
and collective human input of a search engine’s users [43], 

The de-facto monopoly that large Internet service providers have on being able to 
observe user behavior and statistically leverage this valuable information may be over- 
come by building next-generation Web search engines in a truly decentralized and ide- 
ally self-organized manner. Consider a peer-to-peer (P2P) system where each peer has a 
full-fledged Web search engine, including a crawler and an index manager. The crawler 
may be thematically focused or crawl results may be postprocessed so that the local 
index contents reflects the corresponding user’s interest profile. With such a highly spe- 
cialized and personalized “power search engine” most queries should be executed lo- 
cally, but once in a while the user may not be satisfied with the local results and would 
then want to contact other peers. A “good” peer to which the user’s query should be 
forwarded would have thematically relevant index contents, which could be measured 
by statistical notions of similarity between peers. These measures may be dependent 
on the current query or may be query-independent; in the latter case, statistics is used 
to effectively construct a “semantic overlay network” with neighboring peers sharing 
thematic interests [8,42,48, 18,7,66]. Both query routing and “statistically semantic” 
networks could greatly benefit from collective human inputs in addition to standard IR 
measures like term and document frequencies or term-wise score distributions: know- 
ing the bookmarks and query logs of thousands of users would be a great resource to 
build on. 

Further exploring these considerations on P2P Web search should become a major 
research avenue in computer science. Note that our interpretation of Web search in- 
cludes ranked retrieval and thus is fundamentally more difficult than Gnutella-style file 
sharing or simple key lookups via distributed hash tables. Further note that, although 
query routing in P2P Web search resembles earlier work on metasearch engines and 
distributed IR (see, e.g., [46] and the references given there), it is much more challeng- 
ing because of the large scale and the high dynamics of the envisioned P2P system with 
thousands or millions of computers and users. 

7 Conclusion 

With the ongoing information explosion in all areas of business, science, and soci- 
ety, it will be more and more difficult for humans to keep information organized and 
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extract valuable knowledge in a timely manner. The intellectual time for schema de- 
sign, schema integration, data cleaning, data quality assurance, manual classification, 
directory and search result browsing, clever formulation of sophisticated queries, etc. 
is already the major bottleneck today, and the situation is likely to become worse. In 
my opinion, this will render all attempts to master Web-scale information in a perfectly 
consistent, purely logic-based manner more or less futile. Rather, the ability to cope 
with uncertainty, diversity, and high dynamics will be mandatory. To this end, statistics 
and their use in probabilistic inferences will be key assets. 

One may envision a rich probabilistic algebra that encompasses relational or even 
object-relational and XML query languages, but interprets all data and results in a prob- 
abilistic manner and always produces ranked result result lists rather than Boolean result 
sets (or bags). There are certainly some elegant and interesting, but mostly theoretical, 
approaches along these lines (e.g., [27, 29, 37]). However, there is still a long way to go 
towards practically viable solutions. Among the key challenges that need to be tackled 
are customizability, composability, and optimizability. 

- Customizability: The appropriate notions of ontological relationships, “semantic” 
similarities, and scoring functions are dependent on the application. Thus, the envi- 
sioned framework needs to be highly flexible and adaptable to incorporate applica- 
tion-specific or personalized similarity and scoring models. 

- Composability: Algebraic building blocks like a top-k operator need to be com- 
posable so as to allow the construction of rich queries. The desired property that 
operators produce ranked list with some underlying probability (or “score mass”) 
distribution poses a major challenge, for we need to be able to infer these probabil- 
ity distributions for the results of complex operator trees. This problem is related 
to the difficult issues of selectivity estimation and approximate query processing in 
a relational database, but goes beyond the state of the art as it needs to incorporate 
text term distributions and has to yield full distributions at all levels of operator 
trees. 

- Optimizability: Regardless of how elegant a probabilistic query algebra may be, it 
would not be acceptable unless one can ensure efficient query processing. Perfor- 
mance optimization requires a deep understanding of rewriting complex operator 
trees into equivalent execution plans that have significantly lower cost (e.g., pushing 
selections below joins or choosing efficient join orders). At the same time, the top- 
k querying paradigm that avoids computing full result sets before applying some 
ranking is a must for efficiency, too. This combination of desiderata leads to a great 
research challenge in query optimization for a ranked retrieval algebra. 
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Abstract. This paper introduces the application of Business Intelli- 
gence (BI) technologies in metallurgical manufacturing enterprises in 
China. It sets forth the development procedure and successful cases of BI 
in Shanghai Baoshan Iron & Steel Co., Ltd (Shanghai Basteel in short), 
and puts forward the methodology adaptable to the construction of BI 
systems in the metallurgical manufacturing enterprises in China. Finally, 
it prospects the next generation of BI technologies in Shanghai Baosteel. 
It should be mentioned as well that it is the Data Strategies Dept of 
Shanghai Baosight Software Co., Ltd (Shanghai Baosight in short) and 
the Technology Center of Shanghai Baoshan Iron & Steel Co., Ltd. that 
supports and does research works on BI solutions in Shanghai Baosteel. 



1 Introduction 

1.1 The Application of BI Technologies in Metallurgical 
Manufacturing Enterprises in the World 

The executives of enterprises sometimes are totally at a loss when they face with 
the explosive increasing data from different kinds of application systems with 
different levels such as MES, ERP, CRM, SCM, etc. Statistics show that the 
amount of data will be doubled within eighteen months. But among them, how 
much do we really need, and how much do we really can use for the further 
analysis? The main advantage of BI technologies is to discover and turn these 
massive data into the useful information for enterprise decision-making. 

The researches and application of BI have become a hot topic in global IT 
area since the term of BI technology was first brought forward by Howard Dres- 
ner from Gartner Group in 1989. Through our years practice, we consider BI a 
concept rather than an information technology. It is a business concept in solv- 
ing the problems for enterprise production, operation, management, and etc. 
Taking enterprise data warehouse as basis, the BI technologies uses professional 
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knowledge and special data mining technologies to disclose key factors in solving 
business problems, and assisting operational management and decision-making. 

As the most advanced metallurgical manufacturing enterprise in China, Shang- 
hai Baosteel has begun to use BI technologies in solving some key problems in 
daily production and management since last decade. It has applied BI tech- 
nologies such as data analysis and data mining in both self-motion and self- 
consciousness since the development of Solution of Iron Ores Mixing inl995, and 
thereafter quality control system, SPC, IPC, and finally the large-scale enterprise 
data warehouse nowadays. In the meantime, Shanghai Baosiglrt has formed its 
own characteristics of applying BI in metallurgical manufacturing enterprises, es- 
pecially in quality control area. In addition, Shanghai Baosiglrt has cultivated its 
experienced professional team in system development and project management. 
Following are the some achievements in specific areas. 

Data Warehouse: Considering the size, complexity and technical level, 
Shanghai Baosteel enterprise data warehouse system is a rare and advanced sys- 
tem in China. As a successful BI case, such data warehouse system has become 
a model in metallurgical manufacturing today. 

Quality Control and Analysis: In such area, many data mining techniques 
with high level technologies and characteristics have been widely applied for 
quality improvement, and can be extended to other manufacturing enterprises 
as well. 

SPC and IPC: As basis of quality control, SPC and IPC systems with 
special characteristics are commonly used in Shanghai Baosteel. Of course they 
are fitted to the other manufacturing enterprises too. 

The achievements in the above three areas prove that Shanghai Baosteel is 
leading in BI application in metallurgy and manufacturing enterprises in China. 
And with experience transfer, the others metallurgy manufacturing enterprises 
will follow the step of Shanghai Baosteels. And Shanghai Baosiglrt will go further 
too in the related BI application areas. 

Comparing with international craft brothers such as POSCO and the United 
States Steel Corporation (UEC), Shanghai Baosteel is also among the top in BI 
application. UEC once invited Shanghai Baosteel to introduce its experience in 
building metallurgical manufacturing enterprise. 



1.2 The Information System Development of Shanghai Baosteel 

Shanghai Baosteel is the largest and the most modernized iron and steel complex 
in China. Baosteel has established its status as a world steel-making giant with 
comprehensive advantages in its reputation, talents, innovation, management 
and technology. According to the publication ’’Guide to the World Steel Indus- 
try”, Shanghai Baosteel ranks among the first three of the most competitive 
steel-makers worldwide, and is also believed as the most potentially competitive 
iron and steel enterprise in the future. 

Shanghai Baosteel specializes in producing high-tech and high-value-added 
steel products. Meanwhile it has become the main steel supplier to automobile 
industries, household appliances, container, oil and natural gas exploration, and 
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pressure vessel in China. Meanwhile, Shanghai Baosteel exports its products to 
over forty countries and regions including Japan, South Korea and countries in 
Europe and America. 

All the facilities that the company possesses are based on the advanced tech- 
nologies of contemporary steel smelting, cold and hot processing, hydraulic sens- 
ing, electronic control, computer and information communications. They feature 
large-scale, continuity and automation, and are kept the most advanced technol- 
ogy in the world. 

Shanghai Baosteel possesses tremendous strength of research and develop- 
ment. It has made great efforts in developing new technology, new products 
and new equipment, and has accumulated vigorous driving force for company’s 
further development. 

Shanghai Baosteel is located in Shanghai, China. Its first phase construction 
project began on the 23rcl of December in 1978, and was completed and put into 
production on the 15th of September in 1985. Its second phase project went into 
operation in June, 1991 and third phase project was completed before the end 
of 2000. Shanghai Baosteel turned to be a stock company officially on the 3rd of 
February in 2000, and was successfully listed in Shanghai Security Exchange on 
the 12th December in the same year. 

In the early days when Shanghai Baosteel was setting up in 1978, the sponsors 
considered that they should build up computer systems to assist management. 
They realized it should import the most advanced equipments, techniques and 
management at the time from Japan, and take some factories of the Nippon 
Steel as models. 

In May 1981, with the impelling of the minister from the Ministry of the 
Metallurgy and Manufacturing, Shanghai Baosteel finished the ’’the Feasible 
Research of the Synthetic Computer System”, and lodged to build Shanghai 
Baosteel information system with five-level computer structures by setting up 
four area-control computer systems between the L3 systems and the central 
management information system. 

On the 15th February 1996, Shanghai Baosteel and IBM contracted to im- 
port the advanced computer system of IBM 9672 from the US as the area level 
management information system of hot and cold rolling areas in phase three 
project, changing the way in phase two project that there were two respective 
management systems within hot rolling areas and cool rolling areas. The deci- 
sion was a revolution on information system construction in Shanghai Baosteel. 
And in the coming days, the executives of Shanghai Baosteel decided to build 
the comprehensive information system using IBM 9672 to integrate the whole 
distributed information systems. They then cancelled the fifth-level management 
information system, and the new system was put into production in March 1998, 
ensuring the proper production of 1580 hot rolling mill, 1420 cold rolling mill, 
and following second steel-making system. 

In May 2001, Shanghai Baosteel raised new strategic concept of Enterprise 
System Innovation. The ESI system included a three level architecture. First to 
rebuild the business processes of Shanghai Baosteel to bring up new effective 
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ones; second, to reconstruct the organizational structure on the basis of new 
business processes; third to build corresponding information systems to assist to 
realize the new business processes. The main objective of ESI system is to help 
Shanghai Baosteel to realize its business target, and to be a good competitor 
in steel enterprises, and to prepare to face the overall challenges after China 
becomes a member of WTO. 

The above ESI decision was prospectively made by the executives of Shanghai 
Baosteel, to help Baosteel to realize its modernized management, and to be one 
of the Global 500 in the world. 

And now Shanghai Baosteel has successfully finished its third phase infor- 
mation system development. In the first phase project, several process control 
systems, self-developed central management system (IBM 4341) with batch pro- 
cessing, and PC networks were set up. In the second phase project, process 
control systems and product control systems, imported technology based man- 
agement information system (IBM 4381) for 2050 hot rolling mill, self-developed 
management information system (IBM RS6000) for 2030 cold rolling mill, iron- 
making regional management information system, steel-making regional man- 
agement information system were built. In the third phase project, better con- 
figured process control systems, production control systems for 1580 hot rolling 
mill, and 1420, 1550 cold rolling mills, enterprise-wide OA and human resource 
management system, and ERP system which included integrated production and 
sales system and equipment management system, were successfully developed. 

After the three phase project construction, Shanghai Baosteel has formed its 
four-level production computer system. In recent years, with ESI concept, many 
assisted information systems were set up as well, such as integrated equipment 
maintenance management system, data warehouse and data mining applications, 
information services system for mills and departments, e-business platform - 
BSTEEL.COM online, and Supply Chain Management, etc. 

The architecture of Shanghai Baosteel’s information system can be illustrated 
as followed. 




Fig. 1 . Information Architecture of Shanghai Baosteel 
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2 Application of Business Intelligence in Shanghai 
Baosteel 

2.1 The Application and Methodology of Business Intelligence in 
Shanghai BaoSteel 

As one of the most advanced metallurgical manufacturing enterprises in China, 
Shanghai Baosteel is now in its rapid development age. In order to continuously 
reduce cost and improve competitiveness in the international or the domestic 
markets, executives strongly realize the importance of the followings: 

— To speed up the logistic turnover, and to improve the level of the products 
turnover. 

— To stabilize and improve the products quality. 

— To promote the sales and related capabilities to expand markets sharing. 

— To strengthen the infrastructure of cost and finance. 

— To optimize the allocation of enterprise resources, which farthest satisfies 
the markets’ requirements. 

In order to achieve the above objectives, the requirement to build an enterprise 
data warehouse system has been raised. In order to satisfy the strategy of Shang- 
hai Baosteel’s information development, the data warehouse system should help 
Shanghai Baosteel to organize every kind of data required by the enterprise an- 
alysts and to transfer all needed information to end users. Then Shanghai Baos- 
teel and Shanghai Baosight started to evaluate and plan the data warehouse 
system. The evaluation estimates the current enterprise infrastructure and the 
operational environments of Shanghai Baosteel. As the high level of information 
development, the data warehouse system could be built, and planned to build 
the first data warehouse subject area for Shanghai Baosteel - the technique and 
quality management data mart. 

Currently Shanghai Baosteel builds the enterprise data warehouse system on 
two IBM S85 machine with major data source from the ERP system. This data 
warehouse system includes ODS data stores, and perfectly integrated subject 
data stores according to the ’’Quick Data Warehouse Building” methodology. 
The first quality management data mart has accumulated much experience, and 
it has included the decision supporting information about the related products 
and their quality management. Nowadays, the system has already built the enter- 
prise statistics management data mart, technique and quality management data 
mart, sales and marketing management data mart, production management data 
mart, equipment management data mart, finance and cost data mart which in- 
cludes planning values, metal balancing, cost analysis, BUPC, finance analysis, 
and production administration information system, enterprise guidelines system, 
manufacturing mill area analysis which includes steel-making, hot rolling, cold 
rolling, etc. The amount of current data in the system is around 2TB, and the 
ETL task deals with about 3GB data everyday, and the newly appended data 
are about 1GB. In addition, nearly 1700 static analytical reports are produced 
each day, and 1600 kinds of dynamic queries are provided synchronously. 
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At the same time, through many years’ practice and researches, Shanghai 
Baosight has abstracted a set of effective business intelligence solutions for man- 
ufacturing industry. This solution is significant for product designing, quality 
management, cost management in the metallurgical manufacturing enterprise. 
Typically, the implement of business intelligence for metallurgical manufacturer 
consists of the following 6 processes that offer the logical segmentation of works, 
and check whether the project is built steadily. The following flow chart illus- 
trates the overview and work flow for the development phrases of this method- 
ology. 
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Fig. 2. The Methodology of the BI construction 



1. Assessment 

Within this phrase the users’ current situations and conditions should be 
studied. These factors will absolutely affect the data warehouse solutions. 
The target of phase is to analyze the users’ problems and the methods to 
resolve them. The initial assessment should identify and clarify the targets, 
and the requirements for the research for clarifying the targets. This kind of 
assessment will result in the decision of starting, delaying or the canceling 
of a project. 

2. Requirements investigation 

In this phrase, the project group gathers the high level requirements in the 
aspects of operation and information technology (IT), and collects the infor- 
mation required by the departments’ targets. The result of this phrase is to 
submit a report, which identifies the business purpose, meanings, informa- 
tion requirements and the user interfaces. These requirements are also going 
to be used in other phases of the project and the design of data warehouse. 
In addition, the topic data model and data warehouse subject of enterprise 
level are accomplished in this phrase. 

3. Design 

In the side of subject selection, the item group fasten on the collection de- 
tailed information request and designing of the scheme of the data flat roof 
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include data, process, application modeling. In this phrase, many kinds of 
methods of collect information and test, such as data modeling, processing 
modeling, meeting, prototype presentation are used. 

Item group evaluate the technology scheme, business request and informa- 
tion request. Now, the difference between the IT scheme and the requested 
IT scheme is very outstanding. So it is advised that an appropriate data 
warehouse design and scheme should be applied. 

4. Construction 

This phrase includes creating physical databases and data gathering, appli- 
cation testing and code review. The manager of the data warehouse and the 
leader of end-user should know well the system. After successfully test, the 
data platform can be used. 

5. Deployment and maintenance 

In this phrase, the data warehouse and BI system can be displayed to busi- 
ness users. At the same time, trainings to the users should start too. After 
deployment, maintenance and users opinions should be considered. 

6. Summary 

In this phrase, the whole project should be evaluated, and it consists of 
three steps. The first is to sum up the success and lessons learned. Second is 
to check whether the configuration is realized as expected. If needed, plans 
should be changed. The third is to evaluate the influence and the benefit to 
the company. 

2.2 Successful cases of Shanghai Baosteel’s BI application 

Shanghai Baosteel’s BI involves knowledge not only data warehouse, mathemat- 
ics and statistics, data mining and knowledge discovery, but also professional 
knowledge of metallurgy, automatic control, management, etc. These are the 
main characteristics of Shanghai Baosteel’s BI application. And there are many 
successful cases in Shanghai Baosteel till now. 

— The Production Administration System Based on Data Warehouse 

As a metallurgical manufacturing enterprise, rational production and proper 
administration is required in Shanghai Baosteel. According to the manage- 
ment requirement, in order to report the latest production status to the 
high level executives and get the latest guides from top managers, managers 
from all mills and functional departments must take part in the morning 
conference, which is presided by the general manager assistant or the vice 
general manager. Before the data warehouse system is built, the correspond- 
ing staff has to go to the production administration center everyday. And 
all conference information was organized by a Foxpro system with manual 
input, and the data mainly came from the phone and ERP system. The new 
production administration system then take the most advantages of enter- 
prise data warehouse system. Based on the product information collected by 
data warehouse system, the system can automatically organize on Web daily 
information of production administration, material flow chart, quality anal- 
ysis results, etc., to support the daily production administration and routine 
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executives’ morning conference. And now with the online meeting system on 
Baosteel intranet, the managers can even take part in the conference and get 
all kinds of information in their offices. After the system has been put into 
production, it has won itself good reputation. 

Integrated Process Control Systems 

Quality is the life of the enterprise. In order to challenge the furious com- 
petition from the market, continuous improvement on the quality control is 
needed. The IPC systems have realized the improvement of the quality dur- 
ing the productions with lowest cost, and form the core ability of Shanghai 
Baosteel’s QA management - Know How. 

As a supporting analysis system, IPC system assists the quality manager’s 
control abilities during production processes, advances the technical per- 
son’s statistical and analysis abilities, and provides more accurate, conve- 
nient and institutional approaches for the operational manipulators to in- 
spect products. These systems integrate both high visualized functions and 
multi-layered data mining functions in a subtle way. 

The Quality Data Mining System 

The quality data mart was the first BI system that brought benefits for 
Shanghai Baosteel, and it plays a more and more important role in daily 
management. On one side, it provides daily reports, and the analysis func- 
tions as online quality analysis, capability changing analysis, quality excep- 
tion analysis, finished product quality analysis, quality cost statistics, index 
data maintenance, integrate analysis, KIV-KOV modeling, and so on. On 
the other side the data mart supports well the quality data mining. 

Quality data mining system based on data warehouse is strongly aid to the 
metallurgy industry. There are many cases of data mining and knowledge dis- 
covery, such as reducing sampling of steel ST12, improve the bend strength 
of the lrot-dip galvanized products of the steel ST06Z, material calculation 
design based on knowledge. In the case of reducing the sampling of steel 
ST12, the original specification required that it must do sampling at both 
head and tail. It cost very much manpower and equipment. After the analysis 
of some key indexes such as the bend strength, tensile strength, etc., some 




Fig. 3. The Production Admin- 
istration System of Shanghai 
Baosteel. 



Fig. 4. The Web Page 
of the Storage Presenta- 
tion. 
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similar analysis results of the head sampling and the tail sampling have been 
found out, and the tail sampling was a little bit worse. Through the reviews 
of some experts and testing practice, after the April of 2004 Shanghai Baos- 
teel released a new operational specification to test only the tail sampling 
for steel ST12. As a result, it reduces cost RMB$2.60m annually. 

— The Iron Ore Mixing System 

Raw material mixing is one of the most important jobs at the beginning of 
the steel-making. Shanghai Baosteel once faced many problems, such as how 
to evaluate a new ore that was not listed in the original ore mixing scheme? 
Which sintering ore mostly affect the final quality? Is there one scheme that 
can fit all different needs? Can we improve the quality of sintering mine while 
at the same time reduce the cost of sinter? 

Data mining in the Iron Ore Mixing System is to find out ways to meet the 
need of all kinds of sintering ores. The system forecasts the sinter quality 
through modeling, supports the mixing method with low cost, creates iron 
ore mixing knowledge database, and also provides friendly user interface. 
The data mining of iron ore mixing is in four steps: data preparation, iron 
ore evaluation with clustering analysis, modeling with neural networks, op- 
timization. The evaluation results from the system are almost the same as 
those from experts. The forecasting accuracy reaches above 85 

— A Defect Diagnosis Expert System 

The defect diagnosis is an important basis of reliability engineering, and is 
and important component and key technology of total quality control. Com- 
puter aided defect diagnosis can reduce and prevent the same defects from 
occurring repeatedly. It can also assist to provide information for decision- 
making. 

The system comes from experiments and massive data made by technicians 
after real accidents happen. It was developed with computer technologies, 
statistics analysis, data mining technologies, and artificial intelligence, and is 
consisted of data storage, statistics analysis, knowledge repository, and defect 
diagnosis. The system contains both high generalized visualized functions 
and multi-layered data mining functions. 




Fig. 5. The Quality Control 
of IPC. 




Fig. 6. Improvement the Bend 
Strength of the Steel ST06Z 
Products. 
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3 The Next Generation of Business Intelligence in 
Shanghai Baosteel 



Shanghai Baosteel is the main body of Shanghai Baosteel Group. As Shanghai 
Baosteel Group became the Fortune’s 2003 Global 500, the application of BI in 
Shanghai Baosteel will be strengthened and developed further. Followings are 
the tasks to perform. 



3.1 Carrying out the Application in Department Level 

Shanghai Baosteel will persist in developing its own characteristics of BI, and 
will take quality control and synthetic reports as its main goal, and will ex- 
tend the combination of IPC, data warehouse and data mining. Quality control 
is the everlasting subject in manufacturing and is a durative market in which 
product design and development should be strengthened. Nowadays enterprises 
emphasize strategies particularly on process improvement in response to both 
daily improvements from client’s requirements and drastic competitive market. 
In the industrial manufacturing, especially metallurgical manufacturing, there 
are many factors that cause quality problems, such as equipment invalidation, 
staff’s carelessness, parameter abnormal, raw material differences, fluctuating 
settings. Especially in large steel enterprises with complicated business and tech- 
nical flows, ’’Timely finding and forecasting exceptions, promptly controlling and 
quality analysis” is a necessity. 

Therefore based on the quality control notion of 6 sigma, the application 
which based on data warehouse technologies, together with process control, fuzzy 
control, neural networks, expert system, data mining, can be applied in compli- 
cated working procedure as blast furnace, iron- making, steel-making, continuous 
casting, steel rolling. It is certainly the road to develop further the BI in depart- 
ment level. 




Fig. 7. The Forecast of the RDI 
in the Iron Ore. 




Fig. 8. The Diagnosis Expert 
System of the Steel Tube. 
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3.2 Strengthening Researches on Application of BI 
in Enterprise Level 

In enterprise level there are many requirements, which can lead to data ware- 
house based EIS, Key Performance Indexes (KPI) system, etc. 

KPI is a measurable management target which can set, sample, calculate aud 
analyze the key parameters of the internal organization flow’s input and output. 
It is a tool that can decompose the organizational strategic goal to the opera- 
tional tasks, and is the basis of the organizational performance management. It 
can make the definite responsibilities to a department manager, and extend to 
the staff in the department. So building a definite credible KPI system is the 
sticking point for a good performance management. 

3.3 Following up the Technical Tide of BI 

and Applying New Technologies into Industry 

BI is a subject which overlaps many disciplines. Shanghai Baosteel and Shang- 
hai Baosight are actively following up the technical tide of BI and researching 
new BI techniques in the metallurgical manufacturing among fields as stream 
data management, text (message) data mining, KPI practice in manufacturing, 
customized information based on position, knowledge management, etc. 

Stream data management: Data which from L3 system (production control 
system) has the characteristics of stream data, so the knowledge of stream data 
management can be applied when IPC systems need to analyze timely and do 
data mining on the production line. 

Text (message) data mining: Data communication between the ERP system 
and other information systems of Shanghai Baosteel are implemented by mes- 
sages. All the messages have been extracted and loaded into data warehouse 
system. So how to use text mining techniques to analyze and solve exceptions 
quickly will be a new challenge. 

The practice of KPI in manufacturing, customized information based on po- 
sition, and knowledge management are new subjects and trends to provide ex- 
tensive BI application in metallurgical manufacturing. 

Meanwhile, Shanghai Baosight and Technology Center of Shanghai Baosteel 
are fully taking the advantage of the previous experience to develop the data 
mining tools which have independent knowledge property rights. Practical Miner 
from Technology Center has been popular in Shanghai Baosteel for years, while 
Shanghai Baosight is developing a data mining tool according to the standards 
of CWM1.1 and CORBA, and is expected to release early in 2005. 

4 Conclusions 

Shanghai Baosteel is leading in Chinese metallurgical manufacturing industry, 
while it is a leader in BI application as well. With many years’ application and 
practice, it has benefited much from BI. And it will pursue an even further goal 
in BI in the near future. 
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Abstract. Much research has been devoted over the years to investigating and 
advancing the techniques and tools used by analysts when they model. As op- 
posed to what academics, software providers and their resellers promote as 
should be happening, the aim of this research was to determine whether practi- 
tioners still embraced conceptual modelling seriously. In addition, what are the 
most popular techniques and tools used for conceptual modelling? What are the 
major purposes for which conceptual modelling is used? The study found that 
the top six most frequently used modelling techniques and methods were ER 
diagramming, data flow diagramming, systems flowcharting, workflow model- 
ling, RAD, and UML. However, the primary contribution of this study was the 
identification of the factors that uniquely influence the continued-use decision 
of analysts, viz., communication (using diagrams) to/from stakeholders, internal 
knowledge (lack of) of techniques, user expectations management, understand- 
ing models integration into the business, and tool/software deficiencies. 



1 Introduction 

The areas of business systems analysis, requirements analysis, and conceptual model- 
ling are well-established research directions in academic circles. Comprehensive 
analytical work has been conducted on topics such as data modelling, process model- 
ling, meta modelling, model quality, and the like. A range of frameworks and catego- 
risations of modelling techniques have been proposed ( e.g . [6, 9]). However, they 
mostly lack an empirical foundation. Thus, it is difficult to provide solid statements 
on the importance and potential impact of related research on the actual practice of 
conceptual modelling. 

More recently. Wand and Weber [13, p. 364] assume “the importance of concep- 
tual modelling” and they state “Practitioners report that conceptual modelling is diffi- 
cult and that it often falls into disuse within their organizations.” Unfortunately, anec- 
dotal feedback to us from information systems (IS) practitioners confirmed largely the 
assertion of Wand and Weber [13]. Accordingly, as researchers involved in attempt- 
ing to advance the theory of conceptual modelling in organisations, we were con- 
cerned to determine that practitioners still found conceptual modelling useful and that 
they were indeed still performing conceptual modelling as part of their business sys- 
tems analysis processes. Moreover, if practitioners still found modelling useful, why 
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did they find it useful and what were the major factors that inhibited the wider use of 
modelling in their projects. In this way, the research that we were performing would 
be relevant for the practice of information systems development (See the IS Rele- 
vance debate on ISWorld, February 2001). 

Hence, the research in this paper is motivated in several ways. First, we want to ob- 
tain empirical data that conceptual modelling is indeed being performed in IS practice 
in Australia. Such data will give overall assurance to the practical relevance of the 
research that we perform in conceptual modelling. Second, we want to find out what 
are the principal tools, techniques, and purposes for which conceptual modelling is 
performed currently in Australia. In this way, researchers can obtain valuable infor- 
mation to help them direct their research towards aspects of conceptual modelling that 
contribute most to practice. Finally, we were motivated to perform this study so that 
we could gather and analyse data on major problems and benefits unique to the task of 
conceptual modelling in practice. 

So, this research aims to provide current insights into actual modelling practice. 
The underlying research question is “Do practitioners actually use conceptual model- 
ling in practice?” The derived and more detailed questions are: 

What are popular tools and techniques used for conceptual modelling in Australia? 

What are the purposes of modelling? 

What are major problems and benefits unique to modelling? 

In order to provide answers for these questions, an empirical study using a web- 
based questionnaire has been designed. The goal was to determine what modelling 
practices are being used in business, as opposed to what academics, software provid- 
ers and their resellers believe should be used. In summary, we found that the current 
state of usage of business systems/conceptual modelling in Australia is: ER diagram- 
ming, data flow diagramming, systems flowcharting, and workflow modelling being 
most frequently used for database design and management, software development, 
documenting and improving business processes. Moreover, this modelling work is 
supported in most cases by the use of Visio (in some version) as an automated tool. 
Furthermore, planned use of modelling techniques and tools into the short-term future 
appears to be expected to reduce significantly compared to current usage levels. 

The remainder of the paper unfolds in the following manner. The next section re- 
views the related work in terms of empirical data in relation to conceptual modelling 
practice. The third section explains briefly the instrument and methodology used. 
Then, an overview of the quantitative results of the survey is given. The fifth section 
presents succinctly the results of the analysis of the textual data on the problems and 
benefits of modelling. The last section concludes and gives an indication of further 
work planned. 

2 Related Work 

Over the years, much work has been done on how to do modelling - the quality, cor- 
rectness, completeness, goodness of representation, understandability, differences 
between novice and expert modellers, and many other aspects ( e.g [7]). Compara- 
tively little empirical work however has been undertaken on modelling in practice. 
Floyd [3] and Necco et al. [8] conducted comprehensive empirical work into the use 
of modelling techniques in practice but that work is now considerably dated. Batra 
and Marakas [1] attempted to address this problem of a lack of current empirical evi- 
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dence however their work focused on comparing the perspectives of the academic and 
practitioner communities regarding the applications of conceptual data modelling. 
Indeed, these authors simply reviewed the academic and practitioner literatures with- 
out actually collecting primary data on the issue. Moreover, their work is now dated. 
However, it is interesting that they (p. 189) observe “there is a general lack of any 
substantive evidence, anecdotal or empirical, to suggest that the concepts are being 
widely used in the applied design environment.” Batra and Marakas [1, p. 190] state 
that “Researchers have not attempted to conduct case or field studies to gauge the 
cost-benefits of enterprise-wide conceptual data modelling (CDM).” This research has 
attempted to address the problems alluded to by Batra and Marakas [1]. 

Iivari [4] provided some data on these questions in a Finnish study of the percep- 
tions of effectiveness of CASE tools. However, he found the adoption rate of CASE 
tools by developers in organisations very low (and presumably the extent of concep- 
tual modelling to be low as well). More recently, Persson and Stirna [10] noted the 
problem, however, their work was limited in that it was only an exploratory study into 
practice. Most recently, Chang et al. [2] conducted 1 1 interviews with experienced 
consultants in order to explore the perceived advantages and disadvantages of busi- 
ness process modelling. This descriptive study did not, however, investigate the criti- 
cal success factors of process modelling. Sedera et al. [11] have conducted three case 
studies to determine a process modeling success model, however they have not yet 
reported on a planned empirical study to test this model. Furthermore, the studies by 
Chang et al. [2] and Sedera et al. [11] are limited to the area of process modeling. 



3 Methodology 

This study was conducted in the form of a web-based survey issued with the assis- 
tance of the Australian Computer Society (ACS) to its members. The survey consisted 
of seven pages 1 . The first page explained the objectives of our study. It also high- 
lighted the available incentive, i.e., free participation in one of five workshops on 
business process modelling. The second page asked for the purpose of the modelling 
activities. In total, 17 purposes ( e.g ., database design and management, software de- 
velopment) were made available. The respondents were asked to evaluate the rele- 
vance of each of these purposes using a five-point Likert scale ranging from 1 (not 
relevant) to 5 (highly relevant). The third page asked for the modelling techniques 2 
used by the respondent. It provided a list of 18 different modelling techniques ranging 
from data flow diagram and ER diagrams, to the various IDEF standards, up to UML. 
For each modelling technique, the participants had to provide information about the 
past, current and future use of the modelling technique. It was possible to differentiate 
between infrequent and frequent use. Furthermore, participants could indicate whether 
they knew the technique or did not use it at all. It was possible also to add further 
modelling techniques that they used. The fourth page was related to the modelling 
tools. Following the same structure as for the modelling technique, a list of 24 model- 
ling tools was provided. A hyperlink provided a reference to the homepage of each 
tool provided. It was clarified also if a tool had been known under a different name 



1 A copy of the survey pages is available from the authors on request. 

2 ‘Technique’ here is used as an umbrella term referring to the constructs of the technique, 
their rules of construction, and the heuristics and guidelines for refinement. 
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( e.g Designer2000 for the Oracle9i Developer Suite). The, fifth page explored quali- 
tative issues. Participants were asked to list major problems and issues they had ex- 
perienced with modelling as well as perceived key success factors. On the sixth page, 
demographic data was collected. This data included person type (practitioner, aca- 
demic or student), years of experience in business systems analysis and modelling, 
working area (business or IT), training in modelling and the size of the organisation. 
The seventh page allowed contact details for the summarised results of the study and 
the free workshop to be entered. The instrument was piloted with 25 members of two 
research centres as well as with a selected group of practitioners. Minor changes were 
made based on the experiences within this pilot. 

A major contribution of this paper is an examination of the data gathered through 
the fifth page of the survey. This section of the survey asked respondents to list criti- 
cal success factors for them in the use of conceptual modelling and problems or issues 
they encountered in successfully undertaking modelling in their organisations. The 
phenomena that responses to these questions allowed us to investigate were why do 
we continue/discontinue to use a technical method (implemented using a technologi- 
cal tool) - conceptual modelling. To analyse these phenomena, we used the following 
procedure: 

1 . What responses confirm the factors we already know about in regard to these phe- 
nomena; and 

2. What responses are identifying new factors that are unique to the domain of con- 
ceptual modelling? 

To achieve step 1, we performed a review of the current thinking and literature in 
the areas of adoption and continued use of a technology. Then, using Nvivo 2, one 
researcher classified the textual comments, where relevant, according to these known 
factors. This researcher’s classification was then reviewed and confirmed with a sec- 
ond researcher. The factors identified from the literature and used in this first phase of 
the process are summarised and defined in Table 1. 

After step 1 , there remained factors that did not readily fit into one or other of the 
known factor categories. These unclassified responses had the potential to provide us 
with insight on factors unique and important to the domain of conceptual modelling. 
However, the question was how to derive this information in a relatively objective 
and unbiased manner from the textual data. We used a new state-of-the-art textual 
content analysis tool called Leximancer 3 . Using this tool, we identified from the un- 
classified text five new factors specific to conceptual modelling. Subsequently, one 
researcher again classified the remaining responses using these newly identified fac- 
tors. His classification was reviewed and confirmed by a second researcher. Finally, 
the relative importance of each of the new factors was determined. 

3.1 Why Use Leximancer? 

The Leximancer system allows its users to analyse large amounts of text quickly. The 
tool performs this analysis both systematically and graphically by creating a map of 
the constructs - the document map - that are displayed in such a manner that links to 
related subtext may be subsequently explored. Each of the words on the document 
map represents a concept that was identified. The concept is placed on the map in 



3 For more information on Leximancer, see www.leximancer.com 
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Table 1 . Summary of Factors identified for initial analysis 



Factor 


Definition 


Source(s) 


Relative 

Advantage 


The degree to which adopting/using the tech- 
nique is perceived as being better than using the 
practise it supersedes. 


Karahanna et al, 
[5] 


Image 


The degree to which adoption/usage of the tech- 
nique is perceived to enhance ones image or 
status. 


Karahanna et al, 

[5] 


Compatibility 


The degree to which adopting the technique is 
compatible with the individual's job responsibili- 
ties and value system. 


Tan and Teo, 
[12] 


Complexity 


The degree to which using a particular technique 
is free from effort. 


Karahanna et al, 

[5] 


Trialability 


The degree to which one can experiment with the 
technique on a limited basis before making an 
adoption or rejection decision. 


Karahanna et al, 

[5] 


Risk 


The degree of perceived risk that accompanies 
the adoption of the technique. 


Tan and Teo, 
[12] 


Visibility 


The degree to which the technique is visible 
within the organisation. 


Karahanna et al, 
[5] 


Results 

Demonstrability 


The degree to which results of adopting/using the 
technique are observable and communicable to 
others. 


Karahanna et al, 
[5] 


Subjective 

Norms 


Generated by the normative beliefs that a re- 
spondent attributes to what relevant others 
(colleagues/peers/respected management) expect 
them to do with respect to adopting the technique 
as well as their motivation to comply with those 
beliefs. 


Karahanna et al, 
[5] 


Self Efficacy 


Self-confidence in a participant’s own ability to 
perform a behaviour. 


Tan and Teo, 
[12] 


Facilitating 

Conditions 


Availability of and ease of access to, technologi- 
cal infrastructure and support. 


Tan and Teo, 
[12] 


Internalisations 


Degree to which decisions are motivated by 
accepting information from expert sources and 
integrating it into ones cognitive system. 


Karahanna et al, 

[5] 


Identification 


Decisions resulting from feeling some bond with 
a likeable source. 


Karahanna et al, 
[5] 


Compliance 


Degree of influence that is produced by a power- 
ful source having control over the respondent in 
the forms of rewards and punishments. 


Karahanna et al, 

[5] 


Top management 
support 


Degree of support for the project from middle 
and upper management of the organisation. 




Communication 

Issues 


Degree to which the decisions or attitudes were 
affected by communications problems between 
the respondents and key stakeholders within the 
organisation. 





proximity of other concepts in the map through a derived combination of the direct 
and indirect relationships between those concepts. Essentially, the Leximancer system 
is a machine-learning technique based on the Bayesian approach to prediction. The 
procedure used for this is a self-ordering optimisation technique and does not use 
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neural networks. Once the optimal weighted set of words is found for each concept, it 
is used to predict the concepts present in fragments of related text. In other words, 
each concept has other concepts that it attracts (or is highly associated with contextu- 
ally) as well as concepts that it repels (or is highly disassociated with contextually). 
The relationships are measured by the weighted sum of the number of times two con- 
cepts are found in the same ‘chunk’ . An algorithm is used to weight them and deter- 
mine the confidence and relevancy of the terms to others in a specific chunk and 
across chunks. 

Leximancer was selected for this qualitative data analysis for several reasons: 

• Its ability to derive the main concepts within text and their relative importance 
using a scientific, objective algorithm; 

• Its ability to identify the strengths between concepts (how often they co-occur) - 
centrality of concepts; 

• Its ability to assist the researcher in applying grounded theory analysis to a textual 
dataset; 

• Its ability to assist in visually exploring textual information for related themes to 
create new ideas or theories; and 

• Its ability to assist in identifying similarities in the context in which the concepts 
occur - contextual similarity. 

4 Survey Results and Discussion 

From 674 individuals who started to fill out the survey, 370 actually completed the 
entire survey, which leads to a completion rate of 54.8%. Moreover, of the 12,000 
members of the ACS, 1,567 indicated in their most recent membership profiles that 
they were interested in conceptual modelling/business systems analysis. Accordingly, 
our 370 responses indicate a relevant response rate of 23.6%, which is very acceptable 
for a survey. Moreover, we offered participation in one of five seminars on business 
process modelling free of charge as an inducement for members to participate. This 
offer was accepted by 186 of 370 respondents. Corresponding with the nature of the 
ACS as a professional organisation, 87% of the participants were practitioners. The 
remaining respondents were academics (6%) and students (7%). It is also not a sur- 
prise that 85% of the participants characterised themselves as an IT service person 
while only 15% referred to themselves as a businessperson or end user. 

Sixty-eight percent of the respondents indicated that they gained their knowledge 
in Business Systems Analysis from University. Further answers were TAFE (Techni- 
cal and Further Education) (6%), ACS (3%). Twenty-three percent indicated that they 
did not have any formal training in Business Systems Analysis. Forty percent of the 
respondents indicated that they have less than five years experience with modelling. 
Thirty-eight percent have between 5 and 15 years of experience. A significant propor- 
tion, 22%, has more than 15 years of experience with modelling. These figures indi- 
cate that the average expertise of the respondents is supposedly quite high. Twenty- 
eight percent of respondents indicated that they worked in firms employing less than 
50 people, most likely small software consulting firms. However, a quarter of the 
respondents worked in organisations with 1000 or less employees. So, by Australian 
standards, they would be involved in software projects of reasonable size. 

We were concerned to obtain information in three principle areas of conceptual 
modelling in Australia viz., what techniques are used currently in practice, what tools 
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are used for modelling in practice, and what are the purposes for which conceptual 
modelling is used. 

Table 2 presents from the data the top six most frequently used modelling tech- 
niques. It describes the usage of techniques as not known or not used, infrequently 
used (which in the survey instrument was defined as used less than five times per 
week), and frequently used. The table clearly demonstrates that the top six most fre- 
quently used (used 5 or more times a week) techniques are ER diagramming, data 
flow diagramming, systems flowcharting, workflow modelling (range of workflow 
modelling techniques), RAD, and UML. It is significant to note that even though 
object-oriented analysis, design, and programming has been the predominant para- 
digm for systems development over the last decade 64 percent of respondents either 
did not know or did not use UML. While not every conceptual modelling technique 
available was named in the survey, the eighteen techniques used were selected based 
on their popularity reported in prior literature. It is interesting again to note that ap- 
proximately 40 percent of respondents (at least) do either not know or use any of the 
18 techniques named in the survey. 



Table 2. Top six modelling techniques most frequently used 



Description 


Not Known/ 
Not Used 


% 


Infrequently 

Used 


% 


Frequently 

Used 


% 


ER diagram 


154 


42% 


70 


19% 


146 


39% 


Data flow diagram 


152 


41% 


91 


25% 


127 


34% 


System flowcharts 


153 


43% 


94 


26% 


112 


31% 


Workflow modelling 


187 


52% 


88 


24% 


86 


24% 


RAD (rapid application 
development) 


227 


63% 


55 


15% 


79 


22% 


UML (unified model- 
ling language) 


232 


64% 


60 


16% 


72 


20% 



Moreover, while not explicitly reported in Table 2, this current situation of non- 
usage appears to be set to increase into the short-term future (next 12 months) as the 
planned frequent use of the top four techniques is expected to drop to less than half its 
current usage, viz., ER diagramming (17 percent), data flow diagramming (15 per- 
cent), systems flowcharting (10 percent), and workflow modelling (12 percent). Fur- 
thermore, no increase in the intention to use any of the other techniques was reported, 
to balance this out. Perhaps, this short-term trend reflects the perception that the cur- 
rent general downturn in the IT industry will persist into the future. Accordingly, 
respondents perceive a significant reduction of new developmental work requiring 
business systems modelling in the short-term future. It may also just reflect the lack of 
planning of future modelling activities. 

Our work was also interested in what tools were used to perform the conceptual 
modelling work that was currently being undertaken. Table 3 presents the top six 
most frequently used tools when performing business systems analysis and design. 
The data is reported using the same legend as that used for Table 2. 

Again, while not every conceptual modelling tool available was named in the sur- 
vey, the twenty-four tools were selected based on their popularity reported in prior 
literature. Table 3 clearly indicates that Visio (58 percent - both infrequent and fre- 
quent use) is the preferred tool of choice for business systems modelling currently. 
This result is not surprising as the top four most frequently used techniques are well 
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supported by Visio (in its various versions). A long way second in frequent use is 
Rational Rose (19 percent - both infrequent and frequent use) reflecting the current 
level of use of object-oriented analysis and design techniques. Again, at least 40 per- 
cent of respondents (approximately) do either not know or use any of the 24 tools 
named in the survey - even a relatively simple tool like Flowcharter or Visio. 



Table 3. Top six most frequently used tools 



Description 


Not Known/ 
Not Used 


% 


Infrequently 

Used 


% 


Frequently 

Used 


% 


Visio 


150 


42% 


57 


16% 


148 


42% 


Rational Rose 


285 


81% 


33 


9% 


36 


10% 


Oracle9i Developer 
Suite 


302 


85% 


31 


9% 


21 


6% 


iGrafx Flowcharter 


284 


80% 


49 


14% 


22 


6% 


AllFusion ERwin 
Data Modeler 


333 


94% 


12 


3% 


10 


3% 


WorkFlow Modeler 


346 


97% 


2 


1% 


7 


2% 



Moreover, while not explicitly reported in Table 3, into the short-term future (next 
12 months), the planned frequent use of the top two tools is expected to drop signifi- 
cantly from their current usage levels, viz., Visio (21 percent) and Rational Rose (8 
percent) with no real increase reported for planned use of other tools to compensate 
for this drop. Again, this trend in planned tool usage appears to reflect the fact that 
respondents expect a significant reduction in new developmental work requiring busi- 
ness systems modelling in the short-term future. 

Business systems modelling (conceptual modelling) must be performed for some 
purpose. Accordingly, we were interested in obtaining data on the various purposes 
for which people might be undertaking modelling. Using a five-point Likert scale 
(where 5 indicates Very Frequent Use), Table 4 presents (in rank order from the high- 
est to the lowest score) the average score for purpose of use from the respondents. 



Table 4. Average use score for modelling purpose (in rank order) 



Description 


Average Score 


Standard 

Deviation 


Database design and management 


3.9 


1.2 


Improvement of internal business processes 


3.7 


1.2 


Software development 


3.7 


1.2 


Business process documentation 


3.7 


1.2 


Workflow management 


3.4 


1.2 


Improvement of collaborative business processes 


3.4 


1.3 


Design of Enteiprise Architecture 


3.4 


1.3 


Change management 


3.3 


1.3 


Knowledge management 


3.2 


1.3 


End user training 


3.1 


1.3 


Software configuration 


3.1 


1.3 


Software selection 


2.9 


1.3 


Certification / quality management 


2.8 


1.3 


Activity-based costing 


2.6 


1.4 


Human resource management 


2.6 


1.3 


Auditing 


2.5 


1.3 


Simulation 


2.5 


1.3 
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Table 4 indicates that database design and management remains the highest aver- 
age purpose for use of modelling techniques. This fact links to the earlier result of ER 
diagramming being the most frequently used modelling technique. Moreover, soft- 
ware development as a purpose would support the high usage of data flow diagram- 
ming and ER diagramming noted earlier. Indeed, the relatively highly regarded pur- 
poses of documenting and improving business processes, and managing workflows, 
would support further the relatively high usage of workflow modelling and flowchart- 
ing indicated earlier. The more specialised tasks like identifying activities for activity- 
based costing and internal control purposes in auditing appear to be relatively infre- 
quently used purposes for modelling. This fact however may derive from the type of 
population that was used for the survey, viz., members of the Australian Computer 
Society. 



5 Textual Analysis Results and Discussion 

Nine hundred and eighty (980) individual comments were received across the ques- 
tions on critical success factors and problems/issues for modelling. Using the known 
factors (Table 1) influencing continued use of new technologies in firms. Table 5 
shows the classification of the 980 comments after phase 1 of the analysis using 
Nvivo. 



Table 5. Results of classification by key factors influencing continued use (after phase 1) 



Key 


Percentage 


Totals 


Relative Advantage/Usefulness 


45% 


441 


Complexity 


8% 


74 


Compatibility 


7% 


69 


Internalisations 


6% 


54 


Top Management Support 


5% 


48 


Facilitating Conditions 


4% 


42 


Image 


0% 


0 


Trialability 


0% 


4 


Risk 


1% 


11 


Visibility 


0% 


2 


Results Demonstrability 


1% 


5 


Subjective Norms 


2% 


22 


Self-Efficacy 


1% 


14 


Identification 


0% 


2 


Compliance 


0% 


2 


Communication Issues 


3% 


25 


Unclassified 


17% 


165 


Total (All records) 


100.00% 


980 



Clearly, relative advantage (disadvantage)/usefulness from the perspective of the 
analyst was the major driving factor influencing the decision to continue (discontinue) 
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modelling. Does conceptual modelling (and/or its supporting technology) take too 
much time, make my job easier, make my job harder, and make it easier/harder for me 
to elicit/confirm requirements with users? Such comments typically contributed to 
this factor. Furthermore, it is not surprising to see that complexity of the method 
and/or tool, compatibility of the method and/or tool with the responsibilities of my 
job, the views of “experts”, and top management support were other major factors 
driving analysts’ decisions on continued use. Prior literature had told us to expect 
these results, in particular, the key importance of top management support to the con- 
tinued successful use of such key business planning and quality assurance mecha- 
nisms as conceptual modelling for systems. 

However, nearly one-fifth of the comments remained unclassified. Were there any 
new, important factors unique to the conceptual modelling domain contained in this 
data? Fig. 1 shows a document (concept) map produced by Leximancer from the 
unclassified comments. 



Iterations 3 2000 




Fig. 1 . Concept map produced by Leximancer on the unclassified comments 



Five factors were identified from this map using the centrality of concepts and the 
relatedness of concepts to each other within identifiable ‘chunks’. While the resolu- 
tion of the Leximancer generated concept map (Fig. 1) may be difficult to read on its 
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own here, the concepts (terms) depicted are referred to within the discussion of the 
relevant factors below. 

A. Internal Knowledge (Lack of) of Techniques 

This group centred on such concepts as knowledge, techniques, information, large, 
easily and lack. Related concepts were work, systems, afraid, UML and leading. Ac- 
cordingly, we used these concepts to identify this factor as the degree of di- 
rect/indirect knowledge (or lack of) in relation to the use of effective modelling tech- 
niques. Highlighted inadequacies raise issues of the modeller’s skill level and 
questions of insufficient training. 

B. User Expectations Management 

This group centred on such concepts as expectations, stakeholders, audience and re- 
view. Understanding, involved, logic and find were related concepts. Consequently, 
we used these items to identify this factor as issues arising from the need to manage 
the expectations of users as to what they expect conceptual modelling to do for them 
and to produce. In other words, the analyst must ensure that the stakeholders/audience 
for the outputs of conceptual modelling have a realistic understanding of what will be 
achieved. Continued (discontinued) use of conceptual modelling may be influenced 
by difficulties experienced (or expected) with users over such issues as acceptance, 
understanding and communication of the outcomes of the modelling techniques. 

C. Understanding the Models Integration into the Business 

This group centred on understanding, enterprise, high, details, architecture, logic, 
physical, implementation and prior. Accordingly, we identified a factor as the degree 
to which decisions are affected by stakeholder/modeller’s perceived understanding (or 
lack of) in relation to the models integration into business processes (initial and ongo- 
ing). In other words, for the user, to what extent do the current outputs of the model- 
ling process integrate with the existing business processes and physical implementa- 
tions to support the goals of the overall enterprise architecture? 

D. Tool/Software Deficiencies 

This group was focused on such concepts as software, issues, activities, and model. 
Subsequently, a factor was identified as the degree to which decisions are affected by 
issues relating directly to the perceived lack of capability of the software and/or the 
tool design. 

E. Communication (Using Diagrams) to/from Stakeholders 

This final group involved such concepts as diagram, information, ease, communica- 
tion, method, examples, and articulate. Related concepts were means, principals, 
inability, hard, audience, find, and stakeholders. From these key concepts, we de- 
duced a factor as the degree to which diagrams can facilitate effective communication 
between analysts and key stakeholders in the organisation. In other words, to what 
extent can the use of diagrams enhance (hinder) the explanation to, and understanding 
by, the stakeholders of the situation being modelled? 

Using these five new factors, we revisited the unclassified comments and, using the 
same dual coder process as before, we confirmed a classification for those outstanding 
comments easily. Table 6 presents this classification and the relative importance of 
those newly identified factors. 
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Table 6. Relative importance of factors unique to conceptual modelling 



Key 


Percentage 


Total 


Communication (Diagrams) to/from Stakeholders 


28% 


46 


Internal Knowledge (Lack of) of Techniques 


27% 


44 


User Expectations Management 


18% 


30 


Understanding models integration into the business 


17% 


28 


Tool/Software deficiencies 


10% 


17 


Total: 


100% 


165 



As can be seen from Table 6, communication using diagrams and internal knowl- 
edge (lack of) of the modelling techniques are major issues specific to the continued 
use of modelling in organisations. To a lesser degree, properly managing users’ ex- 
pectations of modelling and ensuring users understand how the outcomes of a specific 
modelling task support the overall enterprise systems architecture are important to the 
continued use of conceptual modelling. Deficiencies in software tools that support 
conceptual modelling frustrate the analyst’s work occasionally. 



6 Conclusions and Future Work 

This paper has reported the results of a survey conducted nationally in Australia on 
the status of conceptual modelling. It achieved 370 responses and a relevant response 
rate of 23.6 percent. The study found that the top six most frequently used modelling 
techniques were ER diagramming, data flow diagramming, systems flowcharting, 
workflow modelling, RAD, and UML. Furthermore, it found that clearly Visio is the 
preferred tool of choice for business systems modelling currently. Rational Rose and 
Oracle Developer suite were a long way second in frequent use. Database design and 
management remains the highest average purpose for use of modelling techniques. 
This fact links to the result of ER diagramming being the most frequently used model- 
ling technique. Moreover, software development as a purpose would support the high 
usage of data flow diagramming and ER diagramming. A major contribution of this 
study is the analysis of textual data concerning critical success factors and prob- 
lems/issues in the continued use of conceptual modelling. Clearly, relative advantage 
(disadvantagej/usefulness from the perspective of the analyst was the major driving 
factor influencing the decision to continue (discontinue) modelling. Moreover, using a 
state-of-the-art textual analysis and machine-learning software package called Lexi- 
mancer, this study identified five factors that uniquely influence the continued use 
decision of analysts, viz., communication (using diagrams) to/from stakeholders, in- 
ternal knowledge (lack of) of techniques, user expectations management, understand- 
ing models integration into the business, and tool/software deficiencies. 

The results of this work are limited in several ways. Although every effort was 
taken to mitigate potential limitations, it still suffers from the usual problems with 
surveys, most notably, potential bias in the responses and lack of generalisability of 
the results to other people and settings. More specifically, in relation to the qualitative 
analysis, even though a form of dual coding (with confirmation) was employed, there 
still remains subjectivity in the classification of comments. Furthermore, while the 
members of the research team all participated, the identification of the factors using 



42 Islay Davies et al. 



the Leximancer document map and the principles of relatedness and centrality re- 
mains arguable. 

We intend to extend this work in two ways. First, we will analyse the data further 
investigating cross-tabulations and correlations between the quantitative data and the 
qualitative results reported in this paper. For example, do the factors influencing the 
continued-use decision vary by the demographic dimensions of source of formal train- 
ing, years of modelling experience, and the like. Second, we want to administer the 
survey in other countries (Sweden and Netherlands already) to address the issues of 
lack of generalisability in the current results and cultural differences in conceptual 
modelling. 
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Abstract. Since its introduction, the Entity-Relationship (ER) model has been 
the vehicle of choice in communicating the structure of a database schema in an 
implementation-independent fashion. Part of its popularity has no doubt been 
due to the clarity and simplicity of the associated pictorial Entity-Relationship 
Diagrams (“ERD's”) and to the dependable mapping it affords to a relational 
database schema. Although the model has been extended in different ways over 
the years, its basic properties have been remarkably stable. Even though the ER 
model has been seen as pretty well “settled,” some recent papers, notably [4] 
and [2 (from whose paper our title is derived)], have enumerated what their au- 
thors consider serious shortcomings of the ER model. They illustrate these by 
some interesting examples. We believe, however, that those examples are them- 
selves questionable. In fact, while not claiming that the ER model is perfect, we 
do believe that the overhauls hinted at are probably not necessary and possibly 
counterproductive. 



1 Introduction 

Since its inception [5], the Entity-Relationship (ER) model has been the primary ap- 
proach for presenting and communicating a database schema at the “conceptual” level 
(i.e., independent of its subsequent implementation), especially by means of the asso- 
ciated Entity-Relationship Diagram (ERD). There’s also a fairly standard method for 
converting it to a relational database schema. In fact, if the ER model is in some sense 
“correct,” then the associated relational database schema should be in pretty good 
normal form [15]. Of course, there have been some suggested extensions to Chen’s 
original ideas (e.g., specialization and aggregation as in [10, 19]), some different 
approaches for capturing information in the ERD, and some variations on the map- 
ping to the relational model, but the degree of variability has been relatively minor. 
One reason for the remarkable robustness and popularity of the approach is no doubt 
the wide appreciation for the simplicity of the diagram. Consequently, the desirability 
of incorporating additional features in the ERD must be weighed against the danger of 
overloading it with so much information that it loses its visual power in communicat- 
ing the structure of a database. In fact, the model’s versatility is also evident in its 
relatively straightforward mappability to the newer Object Data Model [7]. Now ad- 
mittedly an industrial strength ERD reflecting an actual enterprise would necessarily 
be some order of magnitude more complex than even the production numbers in stan- 
dard texts [e.g., 10]. However, this does not weaken the ability of a simple ERD to 
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capture local pieces of the enterprise, nor does it lessen the importance of ER-type 
thinking in communicating a conceptual model. 

Quite recently, however, both Camps and Badia have demonstrated [4, and 2 (from 
whose paper the title of this one is derived)] some apparent shortcomings in the ER 
model, both in the model itself and in the processes of conversion to the relational 
model and its subsequent normalization. They have illustrated these problems through 
some interesting examples. They also make some recommendations for improve- 
ments, based on these examples. However, while not claiming that the ER model can 
be all things to all users, we believe that the problems presented in the examples de- 
scribed in those two papers are due less to the model and more to its incorrect applica- 
tion. 

Extending the ERD to represent complex multi-relation constraints or constraints at 
the attribute level are interesting research topics, but are not always desirable. We 
claim that representing them would clutter the ERD as a conceptual model at the 
enterprise level; complex constraints would be better specified in a textual or lan- 
guage-oriented format than at the ERD level. 

The purpose of this paper is to take these examples as a starting point to discuss the 
possible shortcomings of the ER model and the necessity, or lack thereof, for modify- 
ing it in order to address them. We therefore begin by reviewing and analyzing those 
illustrations. Section 2 describes and critiques Camps’ scenarios; Section 3 does 
Badia’ s. Section 4 considers some related issues, most notably a general design prin- 
ciple only minimally offered in the ER model. Section 5 concludes our paper. 



2 The Camps Paper 

In [4], the author begins by describing an apparently simple enterprise. It has a 
straightforward ERD that leads to an equally straightforward relational database 
schema. But Camps then escalates the situation in stages, to the point where the ER 
model is not currently able to accommodate the design, and where normalizing the 
associated relational database schema is also unsatisfying. Since we are primarily 
concerned with problems attributed to the ER model, we will concentrate here on that 
aspect of the paper. However, the normalization process at this point is closely tied to 
that model, so we will include some discussion of it as well. We now give a brief 
recapitulation, with commentary. 

At first, Camps considers an enterprise with four ingredients: Dealer, Product, 
State, and Concession, where Concession is a ternary relationship among the other 
three, implemented as entity types. Each ingredient has attributes with fairly obvious 
semantics, paraphrased here: d-Id, d-Address; p-Id, p-Type; s-Id, s-Capital; and c- 
Date. The last attribute’s semantics represents the date on which a given state awards 
a concession to a given dealer for a given product. As for functional dependencies, 
besides the usual ones, we are told that for a given state/product combination, there 
can only be one dealer. Thus, a minimal set of dependencies is as follows: 

{ s-Id, p-Id] -> d-Id 
js-Id, p-Id] -> c-Date 

d-Id -> d-Address (A) 

p-Id -> p-Type 
s-ld -> s-Capital 
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An ERD for this is given in Figure 1 (attributes are eliminated in the figures, for 
the sake of clarity), and the obvious relational database schema is as follows: 

State( s-Id . s-Capital 
Product( p-Id . p-Type) 

Dealer( d-Id . d-Address) * 

Concession! s-Id, p-Id . d-Id, c-Date) 




Fig. 1. Example of 1:N:N relationship (from Figure 1 in [4], modified) 



The foreign key constraints derive here from the two components of Concession’s 
key, which are primary keys of their native schemas. Since the only functional de- 
pendencies are those induced by keys, the schema is in BCNF. Here Camps imposes 
further constraints: 

p-Id -> d-Id 
s-Id->d-Id 

In other words, if a product is offered as a concession, then it can only be with a sin- 
gle dealer regardless of the state; and analogously on the state-dealer side. The au- 
thor is understandably unhappy about the absence of a standard ERD approach to 
accommodate the resulting binary constraining relationships (using the language of 
[12]), which he renders in a rather UML-like fashion [17], similar to Figure 2. At this 
point, in order to highlight the generic structure, he introduces new notation (A, B, C, 
D for State, Dealer, Product, Concession, respectively). However, we will keep the 
current ones for the sake of comfort, while still pursuing the structure of his narrative. 
He notes that the resulting relational database schema includes the non-3NF relation 
schema Concession! s-Id.p-Id .d-Id.c-Date ). Further, when Camps wishes to impose the 
constraints that a state (respectively product) instance can determine a dealer if and 
only if there has been a concession arranged with some product (respectively state), 
he expresses them with these conditions: 

71 s-id,d-td (Concessions) = n s _ Idd _ Id (State) 

71 P -id,d-id (Concessions) = ji p _ Idd _ Id (Product) 

Each of these can be viewed as a double inclusion dependency and must be ex- 
pressed using the CHECK construct in SQL. 
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Fig. 2. Two imposed FDs (from Figure 2 of [4]) 



Now we note that it is actually possible to capture the structural properties of the 
enterprise at this stage by the simple (i.e., ternary-free) ERD of either Figure 3a [13] 
or Figure 3b [18]. The minimal set of associated functional dependencies in Figure 3a 
is as follows: 

s-ld -> s-Capital 
p-Id -> p-Type 

d-Id -> d-Address , n . 

s-ld -> d-ld <U) 

p-Id -> d-Id 
[s-ld, p-Id] -> c-Date 

One, therefore, obtains the following relational database schema, which is, of 
course, in BCNF, since all functional dependencies are due to keys: 

State( s-Id .s-Capital.d-Id) 

Product( p-Id .p-Tvpe.d-Id) 

Dealer( d-Id ,d- Address) 

Concession) s-Id.p-Id .c-Datel 




Fig. 3a. A binary model of Figure 2 with Concession as a M:N relationship 
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Fig. 3b. A binary model of Figure 2 with Concession as an intersection (associate) entity 



Admittedly, this approach loses something: the ternary character of Concession. 
However, any dealer-relevant information to a concession instance can be discovered 
by a simple join; a view can also be conveniently defined. The ternary relationship in 
Figure 2 is therefore something of a red herring when constraining binary relation- 
ships are imposed to a ternary relationship. In other words, it is possible that an ex- 
pansion of the standard ERD language to include n-ary relationships’ being con- 
strained by m-ary ones might be a very desirable feature, but its absence is not a 
surprising one. 

Jones and Song showed that the ternary schema with FDs imposed in Figure 2 
can have lossless decomposition, but cannot have an FD-preserving schema (Pattern 
1 1 in [13]). Camps now arrives at the same schema (E) (by normalizing his non-3NF 
one, not by way of our ERD in Figure 3a). The problem he sees is incorporating the 
semantics of (C). The constraints he develops are: 

Ji s _ Id p . Id (Concessions) c ji s Id pId (State*Product) 

7i „ (State) <z 7i Id (Concessions) iff State. d-Id is not null (F) 

7i Id (Product) <z 7i ld (Concessions) iff Product. d-Id is not null 

The last two conditions seem not to make sense syntactically. The intention is most 
likely the following (keeping the first condition and rephrasing the other two): 

71 s _ Idp _ Id (Concessions) c n s _ Id p _ Id (State*Product) 

( V s o G 7t s -i d (S tat e))(s 0 e 7t s ^(Concessions) iff (3d 0 )(<s 0 ,d 0 > 6 7i s Id _ d _ Id (State))) ( G) 

(V Po e 7i p Id (Product))(p 0 e 7i p ^(Concessions) iff (3d 0 )(<p 0 ,d 0 > e 7i p H d . Id (Product))) 

At any rate. Camps shows how SQF can accommodate these conditions too using 
CHECKs in the form of ASSERTIONS, but he considers any such effort (to need any 
conditions besides key dependencies and inclusion constraints) to be anomalous. We 
feel that this is not so surprising a situation after all. The complexity of real-world 
database design is so great that, on the contrary, it is quite common to encounter a 
situation where many integrity constraints are not expressible in terms of functional 
and inclusion dependencies alone. Instead, one must often use the type of construc- 
tions that Camps shows us or use triggers to implement complex real-world integrity 
constraints. 

3 The Baida Paper 

In his paper [2] in turn, Badia revisits the ER model because of the usefulness and 
importance of the ER model. He contends that, as database applications get more 
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complex and sophisticated and the need for capturing more semantics is growing, the 
ER model should be extended with more powerful constructs to express powerful 
semantics and variable constraints. He presents six scenarios that apparently illustrate 
some inadequacies of the ER model; he classifies the first five as relationship con- 
straints that the model is not up to incorporating and the sixth as an attribute con- 
straint. We feel that some of the examples he marshals, described below in 3.3 and 
3.6, are questionable, leading us to ask whether they warrant extending the model. 
Badia does discuss the down side of overloading the model, however, including a 
thoughtful mention of tradeoffs between minimality and power. In this section we 
give a brief recapitulation of the examples, together with our analyses. 

3.1 Camps Redux 

In this portion of his paper, Badia presents Camps’ illustrations and conclusions, 
which he accepts. We’ve already discussed this. 

3.2 Commutativity in ERD’s 

In mathematical contexts, we call a diagram commutative [14] if all different routes 
from a common source to a common destination are equivalent. In Figure 4, from 
Badia’s paper (there called Figure 1), there are two different ways to navigate from 
Course to Department: directly, or via the Teacher entity. To say that this particular 
diagram commutes , then, is to say that for each course, its instructor must be a faculty 
member of the department that offers it. Again, there is a SQL construct for indicating 
this. Although Badia doesn’t use the term, his point here is that there is no mechanism 
for ERD’s to indicate a commutativity constraint. This is correct, of course. Consider 
the case of representing this kind of multi-relation constraints in the diagram with 
over just 50 entities and relationships, which are quite common in real-world applica- 
tions. We believe, therefore, that this kind of a multi-relation constraint is better to be 
specified as a textual or a language-oriented syntax, such as OCL [17], rather than at a 
diagram level. In this way, a diagram can clearly deliver its major semantics without 
incurring visual overload and clutter. 




Fig. 4. An example of multi-paths between two entities (from Figure 1 in [2] ) 
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In certain limited situations [8] the Offers relationship might be superfluous and 
recovered by composing the other two relationships (or, in the relational database 
schema, by performing the appropriate joins). We would need to be careful about 
dropping Offers, however. For example, if a particular course were at present un- 
staffed, then the Teaches link would be broken. This is the case when Course entity 
has partial (optional) participation to Department entity. Without an explicit Offers 
instance, we wouldn’t know which department offers the course. This is an example 
of a chasm trap which requires an explicit Offers relationship [6]. Another case 
where we couldn’t rely on merely dropping one of the relationship links would arise if 
a commutative diagram involved the composition of two relationships in each path; 
then we would surely need to retain them both and to implement the constraint explic- 
itly. 

We note that allowing cycles and redundancies in ERD’s has been a topic of re- 
search in the past. Atzeni and Parker [1| advise against it; Markowitz and Shoshani 
[15] feel that it is not harmful if it is done right. Dullea and Song [8, 9] provide a 
complete analysis of redundant relationships in cyclic ERD’s. Their decision rules on 
redundant relationships are based on both maximum and minimum cardinality con- 
straints. 



3.3 Acyclicity of a Recursive Closure 

Next, Badia considers the recursive relationship ManagerOf (on an Employee en- 
tity). He would like to accommodate the hierarchical property that nobody can be an 
indirect manager of oneself. Again, we agree with this observation but can’t comment 
on how desirable such an ER feature would be at a diagram level. Badia points out 
that this is a problem even at the level of the relational database, although some Ora- 
cle releases can now accommodate the constraint. 



3.4 Fan Traps 

At this point the author brings Figure 5 (adapted from [6], where it appears as Figure 
1 1.19(a); for Badia it is Figure 2) to our attention. (The original figure uses the “Mer- 
ise,” or “look here” approach [17]; we’ve modified it to make it consistent with the 
other figures in this paper.) The problem, called a fan trap arises when one attempts 
to enforce a constraint that a staff person must work in a branch operated by her/his 
division. This ER anomaly percolates to the relational schemas as well. Further, if one 
attempts to patch things up by including a third binary link, between Staff and 
Branch, then one is faced with the commutativity dilemma of Section 3.2. In general 
fan traps arise when there are two 1 :N relationships from a common entity type to two 
different destinations. The two typical solutions for fan traps are either to add a third 
relationship between the two many-side entities or rearrange the entities to make the 
connection unambiguous. The problem in Figure 5 here is simply caused by an incor- 
rect ERD and can be resolved by rearranging entities as shown in Figure 6. Figure 6 
avoids the difficulties at both the ER and relational levels. In fact, this fix is even 
exhibited in the Connolly source itself. We note that the chasm trap discussed in Sec- 
tion 3.2 and the fan trap are commonly called connection traps [6] which make the 
connection between two entities separated by the third entity ambiguous. 
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Fig. 5. A semantically wrong ERD with a fan trap (from Figure 2 in [2] and Figure 11.19(a) 
from [6]) 



Staff 


N 


i 




^Dperaterk^ 




w orks_at ^ 




b> 





Fig. 6. A correct ERD of Figure 5, after rearranging entities 



3.5 Temporal Considerations 

Here Badia looks at a Works-in relationship, M:N between Employee and Project, 
with attributes start-date and end-date. A diagram for this might look something like 
Figure 7b; for the purposes of clarity, most attributes have been omitted. Baida states 
that the rule that even though en employee may work in many projects, an employee 
may not work in two projects at the same time may not be represented in an ERD. It 
appears impossible to express the rule, although the relationship is indeed M:N. But 
wouldn’t this problem be solved by creating a third entity type, TimePeriod, with the 
two date attributes as its composite key, and letting Works-in be ternary? The new 
relationship would be M:N:1, as indicated in Figure 7c, with the 1 on the Project 
node, of course. In figures of 7a through 7d, we show several variations of this case 
related to capturing the history of works-in relationships and the above constraint. 
We’ll comment additionally on this in Section 4. 



Employee 



M 




1 



Project 



Fig. 7a. An employee may work in only one project and each project can have many employ- 
ees. The diagram already assumes that an employee must work for only one project at a time. 
This diagram is not intended to capture any history of works-in relationship 



M 



Employee 




N 



Project 



Fig. 7b. An employee may work in many projects and each project may have many employees. 
The diagram assumes that an employee may work for many projects at the same time. This 
diagram is also not intended to capture any history of works-in relationship 
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Fig. 7c. An employee may work in only one project at a time. This diagram can capture a his- 
tory of works-in relationship of an employee for projects and still satisfies the constraint that an 
employee may work in only one project at a time 




Fig. 7d. In Figure 7.c, if entity TimePeriod is not easily materialized, we can reify the relation- 
ship Works-in to an intersection entity. This diagram can capture the history of works-in rela- 
tionship, but does not satisfy the constraint that an employee may work in only one project 



3.6 Range Constraints 

While the five previous cases exemplify what Badia calls relationship constraints, 
this one is an attribute constraint. The example given uses the following two tables: 

Employee ( emploveeid . rank_id, salary, ...) 

Rank (rank_id, max_salary, min_salary) 

The stated problem is that the ERD that represents the above schema cannot ex- 
press the fact that the salary of an employee must be within the range determined by 
his or her rank. Indeed, in order to enforce this constraint, explicit SQL code must be 
generated. Baida correctly sates that the absence of information at the attribute level is 
a limitation and cause difficulty in solving semantic heterogeneity. We believe, how- 
ever, that information and constraints at the attribute level could be expressed at the 
data dictionary level or in a separate low level diagram below the ERD level. Again, 
this will keep an ERD as a conceptual model at enterprise level without too much 
clutter. Consider the complexity of representing attribute constraints in ERDs for real- 
world applications that have over 50 entities and several hundreds of attributes. The 
use of a CASE tool that supports a conceptual ERD with its any low level diagram for 
attributes and/or its associated data dictionary should be a right direction for this prob- 
lem. 

4 General Cardinality Constraints 

While on the whole, as indicated above, we feel many of the alleged shortcomings of 
the ER model claimed in recent papers are not justified, some of those points have 
been well taken and are quite interesting. However, there is another important feature 
of conceptual design that we shall consider here, one that the ER model really does 
lack. In this section, we briefly discuss McAllister’s general cardinality constraints 
[ 16] and their implications. 

McAllister’s setting is a general n-ary relationship R. In other words, R involves n 
different roles. This term is used, rather than entity types, since the entity types may 
not all be distinct. For example, a recursive relationship, while binary in the mathe- 
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matical sense, involves only a single entity type. Given two disjoint sets of roles A 
and B, McAllister defines Cmax(A,B) and Cmin(A,B) as follows: for a tuple <a>, 
with one component from each role in A, and a tuple <b>, with one component from 
each role in B, let us denote by <a,b> the tuple generated by the two sets of compo- 
nents; we recall that A and B are disjoint. Then Cmax(A,B) (respectively Cmin(A,B)) 
is the maximum allowable cardinality over all <a> of the set of tuples <b> such that 
<a,b>e ji A uB (R). For example, consider the Concession relationship of Figure 1. 
Then to say that 

Cmax({State, Product}, {Dealer}) = 1 is to express the fact that {s-Id, p-Id} ->d- 
Id. And the condition Cmin({Product}, {State, Dealer}) = 1 is equivalent to the con- 
straint that Product is total on Concession. Now, as we see from these examples, 
Cmax gives us information about functional dependencies and Cmin about participa- 
tion constraints. When B is a singleton set and A its complement, this is sometimes 
called the “Chen” approach to cardinality [11] or “look across”; when A is a singleton 
set and B its complement, it is called the “Merise” approach [11] or “look here.” All 
told, McAllister shows that there are 3 n -2 n+1 +l different combinations possible for A 
and B, where n is the number of different roles. 

Clearly, given this explosive growth, it is impractical to include all possible cardi- 
nality constraints in a general ERD, although McAllister shows a tabular approach 
that works pretty well for ternary relationships. He shows further that there are many 
equalities and inequalities that must hold among the cardinalities, so that the entries in 
the table are far from independent. The question arises as to which cardinalities have 
the highest priorities and should thus appear in an ERD. It turns out that the Merise 
and Chen approaches give the same information in the binary case but not in the ter- 
nary one, which becomes the contentious case (n>3 is rare enough not to be a serious 
issue). In fact one finds both Chen [as in 10] and Merise [as in 3] systems in practice. 
In his article, Genova feels that UML [17] made the wrong choice by using the Chen 
method for its Cmin’s, and he suggests that class diagrams include both sets of infor- 
mation (but only when either A or B is singleton). That does not seem likely to hap- 
pen, though. 

Still, consideration of these general cardinality constraints and McAllister’s axioms 
comes in handy in a couple of the settings we have discussed. The general setting 
helps understand connections between, for example, ternary and related binary rela- 
tionships as in Figure 2 and [12]. And it similarly sheds light on preservation (and 
loss) of information in Section 3.5 above, when a binary relationship is replaced by a 
ternary one. Finally, we believe that it also provides the deep structural information 
for describing the properties of decompositions of the associated relation schemas. It 
is therefore indisputable in our opinion that these general cardinality constraints do 
much to describe the fundamental structure of a relationship in the ER model; only 
portions of which, like the tip of an iceberg, are currently visible in a typical ERD. 
And yet we are not claiming that such information should routinely be included in the 
model. 



5 Conclusion 

We have reviewed recent literature ([4] and [2]) that illustrate through some interest- 
ing examples areas of conceptual database design that are not accommodated suffi- 
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ciently at the present time by the Entity-Relationship model. However, some of these 
examples seem not to hold up under scrutiny. 

Capabilities that the model does indeed lack are constraints on commutative dia- 
grams (Section 3.2 above), recursive closures (3.3), and some range conditions (3.6) 
as pointed out by Badia. Another major conceptual modeling tool missing in the ER 
model is that of general cardinality constraints [16]. These constraints are the deep 
structure that underlies such more visible behavior as constraining and related rela- 
tionships, Chen and Merise cardinality constraints, functional dependencies and de- 
compositions, and participation constraints. How many of these missing features 
should actually be incorporated into the ER model is pretty much a question of triage, 
of weighing the benefits of a feature against the danger of circuit overload. 

We believe that some complex constraints such as multi-relation constraint are bet- 
ter to be represented as a textual or a language-oriented syntax, such as OCL [17], 
rather than at the ER diagram level. We also believe that information and constraints 
at the attribute level could be expressed at the data dictionary level or in a separate 
low level diagram below the ERD level. In these ways, we will keep an ERD as a 
conceptual model at enterprise level to deliver major semantics without visual over- 
load and too much clutter. Consider the complexity of an ERD for a real-world appli- 
cation that has over 50 entities and hundreds of attributes and representing all those 
complex multi-relation and attribute constraints in the ERD. The use of a CASE tool 
that supports a conceptual ERD with its any low level diagram for attributes and/or its 
associated data dictionary should be a right direction for this problem. 

We note that we do not claim that some research topics suggested by Baida, such 
as relationships over relationships and attributes over attributes, are not interesting or 
worthy. Research in those topics would bring interesting new insights and powerful 
ways of representing complex semantics. What we claim here is that the ERD itself 
has much value as it is now, especially for relational applications, where all the ex- 
amples of Baida indicate. We believe, however, that extending the ER model to sup- 
port new application semantics such as biological applications should be encouraged. 

The “D” in ERD connotes to many researchers and practitioners the simplicity and 
power of communication that account for the model’s popularity. Indeed, as the En- 
tity-Relationship model nears its 30 th birthday, we find its robustness remarkable. 
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Modeling Functional Data Sources as Relations 
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Abstract. In this paper we present a model of functional access to data that, 
we argue, is suitable for modeling a class of data repositories characterized by 
functional access, such as web sites. We discuss the problem of modeling such 
data sources as a set of relations, of determining whether a given query expressed 
on these relations can be translated into a combination of functions defined by the 
data sources, and of finding an optimal plan to do so. 

We show that, if the data source is modeled as a single relation, an optimal plan 
can be found in a time linear in the number of functions in the source but, if the 
source is modeled as a number of relations that can be joined, finding the optimal 
plan is NP-hard. 



1 Introduction 

These days, we see a great diversification in the type, structure, and functionality of the 
data repositories with which we have to deal, at least when compared with as little as 
fifteen or twenty years ago. Not too long ago, one could quite safely assume that almost 
all the data that a program had both the need and the possibility to access were stored 
in a relational database or, were this not the case, that the amount of data, their stability, 
and their format made their insertion into a relational database feasible. 

As of today, such a statement would be quite undefensible. A large share of the 
responsibility for this state of affairs must be ascribed, of course, to the rapid diffusion 
of data communication networks, which created a very large collection of data that a 
person or a program might want to use. Most of the data available on data communi- 
cation networks, however, are not in relational form [1] and, due to the volume and the 
instability of the medium, the idea of storing them all into a stable repository is quite 
unfeasible. 

The most widely known data access environment of today, the world-wide web, 
was created with the idea of displaying reasonably well formatted pages of material 
to people, and of letting them “jump” from one page to another. It followed, in other 
words, a rather procedural model, in which elements of the page definition language 
(tags) often stood for actions: a link specified a “jump” from one page to another. While 
a link establishes a connection between two pages, this connection is not symmetric (a 
link that carries you from page A to page B will not carry you from page B to page 
A) and therefore is not a relation between two pages (in the sense in which the term 

* The work presented in this paper was done under the auspices and with the funding of NIH 
project NCRR RR08 605, Biomedical Informatics Research Network , which the authors grate- 
fully acknowledge. 
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“relation” is used in databases), but rather a functional connection that, given page A, 
will produce page B. 

In addition to this basic mechanism, today many web sites that contain a lot of 
data allow one to specify search criteria using the so-called forms. A form is an input 
device through which a fixed set of values can be assigned to an equally fixed set of 
parameters, the values forming a search criterion against which the data in the web site 
will be matched, returning the data that satisfy the criterion. 

Consider the web site of a public library (an example to which we will return in 
the following). Here one can have a form that, given the name of an author returns a 
web page (or other data structures) containing the titles of the books written by that 
author. This doesn’t imply that a corresponding form will exist that, given the title of 

a book, will return its author. In other words, the dependence author —book is not 

necessarily invertible. This limitation tells us that we are not in the presence of a set 
of relations but, rather, in the presence of a data repository with functional access. The 
diffusion of the internet as a source of data has, of course, generated a great interest in 
the conceptual modeling of web sites [2 — 4] . In this paper we present a formalization 
of the problem of representing a functional data source as a set of relations, and of 
translating (whenever possible) relational queries into sequences of functions. 

2 The Model 

For the purpose of this work, a functional data source is a set of procedures that, given 
a number of attributes whose value has been fixed, instructs us on how to obtain a data 
structure containing further attributes related to the former. 

To fix the ideas, consider again the web site of a library. A procedure is defined that, 
given the name of an author, retrieves a data structure containing the titles of all the 
books written by that author. The procedure for doing so looks something like this: 

Procedure 1 : author - > set(title) 

i) go to the “search by author” page; 

ii) put the desired name into the “author” slot of the form that you find there; 

iii) press the button labeled “go”; 

iv) look at the page that will be displayed next, and retrieve the list of titles. 

Getting the publisher and the year of publication of a book, given its author and title is 
a bit more complicated: 

Procedure 2: author, title - > publisher, year 

i) execute procedure 1 and get a list of titles; 

ii) search the desired title in the list; 

iii) if found then 

iii. 1 ) access the book page, by clicking on the title; 

iii. 2) search the publisher and year, and return them; 

iv) else fail. 

On the other hand, in most library web pages there is no procedure that allows one to 
obtain a list of all the books published by a given publisher in a given year, and a query 
asking for such information would be impossible to answer. 
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We start by giving an auxiliary definition, and then we give the definition of the kind 
of functional data sources that we will consider in the rest of the paper. 

Definition 1. A data sort S is a pair (N. T), written S = N :T, where N is the name 
of the sort, and T its type. Two data sorts are equal if their names and their types 
coincide. 

A data sort, in the sense in which we use the term here, is not quite a “physical” data 
type. For instance, authortstring and title: string are both of the same data type (string) 
but they represent different data sorts '.The set of complex sorts is the transitive closure 
of the set of data sorts with respect to Cartesian product of sorts and the formation of 
collection types (sets, bags, and lists). 

Definition 2. A functional data source is a pair (S, F ) where S = {Si, . . . , S„ } is a 

f 

set of data sorts, F = {/i, . . . , f m } is a set of functions a — - — ►/?, where both a and 
1 3 are composite sorts made of sorts in S. 

In the library web site, author: string, and year:int are examples of data sorts. The 
procedures are instantiations of functions. Procedure 1, for example, instantiates a func- 

/ 

tion author:string >- title: string . 

The elements “author: string” and “title: string” are examples of composite sorts. 
Sometimes, when there is no possibility of confusion, we will omit the type of the sort. 
Our goal in this paper is to model a functional data source like this one in a way that 
resembles a set of relations upon which we can express our query conditions. To this 
end, we give the following definition. 

Definition 3. A relational model of a functional data source is a set of relations R = 
{i?i, . . . , R p } where Ri C Sh x • • • x Si q and all the Si’s are sorts of the functional 
data source. The relation Ri is called a relational fagade/or the underlying data source, 
and will sometimes be indicated as Ri^N^ : X), , . . . , Ni q : Ti ). 

The problems we consider in this paper are the following: (1) Given a model R of 
a functional data source (S, F) and a query on the model, is it possible to answer the 
query using the procedures f defined for the functional data source?, and (2) if the 
answer to the previous question is “yes,” is it possible to find an optimal sequence of 
procedures that will answer the query with minimal cost? 

It goes without saying that not all the queries that are possible on the model are also 
possible on the data source. Consider again the library web site; a simple model for this 
data source is composed of a single relation, that we can call “book,” and defined as: 

book(name:string, title:string, publisher: string, yearfint). 



1 The entities that we call data sorts are known in other quarters as “semantic data types." This 
name, however, entails a considerable epistemological commitment, quite out of place for a 
concept that, all in all, has nothing semantic about it: an author: string is as syntactic an entity 
as any abstract data type, and does not require extravagant semantic connotations. 
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A query like 

(N, T) book(N, T, ‘dover’, 1997), 

asking for the author and title of all books published by dover in 1 997 is quite reasonable 
in the model, but there are no procedures on the web site to execute it. 

We will assume, to begin with, that the model of the web site contains a single 
relation. In this case we can also assume, without loss of generality, that the rela- 
tion is defined in the Cartesian product of all the sorts in the functional data source: 
R(S i, . . . , S n ). Throughout this paper, we will only consider non-recursive queries. It 
should be clear in the following that recursive queries require a certain extension of our 
method, but not a complete overhaul of it. Also, we will consider conjunctive queries 2 , 
whose general form can be written as: 

(Ski , . . . , Sk p ) : R(Si, . . . , Sn'), Sji — ci , ... , Sj q — Cq , cpi(Su, *S , i2), • • • , pi (Sui , Su2) 

CD 

where a,.. . ,c q are constants, all the S’ s come from the sorts of the relation R, and the 
(pi’s are comparison operators drawn from a suitable set, say (pi £ {<,>,=,yC,<,>}. 

We will for the moment assume that the functional data source provides no mecha- 
nism for verifying conditions of the type 0i (<Sj i . £ 12 ). The only operations allowed are 
retrieving data by entering values (constants) in a suitable field of a form or traversing 
a link in a web site with a constant as a label (such as the title of a book in the library 
example). Given the query (1) in a data source like this, we would execute it by first de- 
termining whether the function / : S 3l x • • • x Sj n — > {S^ x • • • x Sk p x Sn x • • • x S u 2 } 
can be computed. If it can, we compute /(ci, . . . , c q ) and, for each result returned, 
check whether the conditions <pi(Su, S-a) are verified. 

The complicated part of this query schema is the first step: the determination of the 
function / that, given the constants in the query, allows us to obtain the query outputs 
{Sk! , ■ ■ ■ , <Sfc p }, augmented with all the quantities needed for the comparisons. 

3 Query Translation 

Informally, the problem that we consider in this section is the following. We have a col- 
lection of data sorts S = {Si, . . . , S n }. Given two data sorts a, (3, defined as Cartesian 
products of elements of S (a = S ai x • • • x S aa and (3 = Sp x x • • • x Sp b ) one can define 
a formal (and unique) correspondence function f a p : a —> (3. This function operates 
on the model of the data source (this is why we used the adjective “formal” for it: it is 
not necessarily a function that one can compute) and, given the values {5 ai , . . . , S aa }, 
returns the corresponding values {Sp 1 , . . . , Sp b }. If {zq, . . . , v a } are the input values, 
this function computes the relational algebra operation 

TtN 01 ,...,N 0b O (TS ai = Vl ...,S aa =Va (2) 

where the N’s are the names of the sorts S, as per definition 1. A correspondence 
function can be seen, in other words, as the functional counterpart of the query (2) 

2 Any query can. of course, be translated in a disjunctive normal form, that is, in a disjunction of 
conjunctive queries. The system in this case will simply pose all the conjunctive queries and 
then take the union of all the results. 
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which, on a single table, is completely general. (Remember that we don’t yet consider 
conditions other than the equality with a constant.) 

The set F = {f af 3 } of all correspondence functions contains the grounding of all 
queries that we might ask on the model. The functional data source, on the other hand, 
has procedures P t , each one of which implements a specific function f a p, a situation 
that we will indicate with P t f a p. The set of all implemented correspondence func- 
tions if = {f\3P : P /}. Our query implementation problem is then, given 
a query q , with the relative correspondence function /, to find a suitable combination 
of functions in F\^ that is equal to /. In order to make this statement more precise, 
we need to clarify what do we mean by “suitable combination of functions” that is, we 
need to specify a function algebra. We will limit our algebra to three simple operations 
that create sequences of functions, as shown in Table 1. (We assume, pragmatically, that 
more complex manipulations are done by the procedures Pi.) 



Table 1. Operators of the function algebra. 



Operation 


Definition 


Description 


Typing 


f 0 9 
(/> 9) 

f xg 


= f(g(x)) 

if,9)(x) = (f(x),g(x)) 

(/ X g)(x,y) = ( f{x),g(y )) 


function composition 
cartesian composition 
cartesian product 


f:a—*p gi'Y — *ol 

j :ot — »p g:a — ►'y 
(f,g):at^3X'y 
.f:a—>(3 g:d—*'Y 
f Xg:otx5— 



A function / € for which a procedure is defined, and that transforms a data 
sort S into a data sort P can be represented as a diagram 

S — ^ P. (3) 

The operators of the function algebra generate diagrams like those in the first and third 
column of Table 2. In order to obtain the individual data types, we introduce the formal 
operator of projection. The projection is “formal” in that it exists only in the diagrams: 
in practice, when we have the data type P x Q we simply select the portion of it that 
we need. The projections don’t correspond to any procedure and their cost is zero. The 
dual of the projection operator is the Cartesian product which, given two data of type A 
and B produces from them a datum of type Ax B. This is also a formal operator with 
zero cost, where the dotted line with the x symbols is there to remind us that we are 
using a Cartesian product operator, and the arrow goes from the type that will appear 
first in the product to the type that will appear second (we will omit the arrow when this 
indication is superfluous). 

The Cartesian product of the functions S — — P and Q — — R is represented 
as 

S x Q /Xg > P x R (4) 

With these operations, and the corresponding diagrams, in place, we can arrange 
the correspondence functions / £ Tj,^ in a diagram, which we call the computation 
diagram of a data source. 
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Definition 4 . The computation diagram of a functional data source is a graph G = 
( N , E ) with nodes labeled by a labeling function X n : N — > S, S being the set of 
composite data sorts of the source, and edges labeled by the labeling function \ e : 
E — > F |.^ such that each edge is one of the following: 

1 . A function edge, such that if the edge is (711,712), then Ab((tii, 712)) : A„(rii) — > 

A„(n 2). and represented as in ( 3 ); 

2. projection edges, 

3 . cartesian product edges 

Let us go back now to our original problem. We have a query and a correspon- 

/ 

dence function Si x • • • x S n — —*~Pi x • • • x P m that we need to compute, where 
S\, , S n are the data sorts for which we give values, and P \ , . . . , P m are the results 
that we desire. In order to see whether the computation is possible, we adopt the fol- 
lowing strategy: first, build the computation diagram of the data source, then we add a 
node called s to the graph, and connect it to Si, . . . , S n , as well as a node d, with edges 
coming from Pi, ... , P m ; finally, we check whether a path exists from s to d. 

If we are to find an optimal solution to the grounding of a correspondence function 
/, we need to assign a cost to each node of the graph and, in order to do this, we need to 
determine the cost of traversing an edge. The cost functions of the various combinations 
that appear in a computation graph are defined in table 2. 



Table 2. Cost of the functional operations in terms of graph path. 



Operation 


Cost 


Operation 


Cost 


A — D 


C(B) = C(A) + C(f) 


A B 

X 

Ax B 


C(A x B) = C(A) + C(B) 


Ax B 

A B 


C(A) = 

C(B) = C(A x B) 


A B 

D 


C(D) =mm(C(A) + c(f 1 ), 
C(S) + c(/ 2 )) 



The problem of finding the optimal functional expression for a given query can 
therefore be reduced to that of finding the shortest path in a suitable function graph, a 
problem that we will now briefly elucidate. Let G be a function graph, G.V the set of 
its vertices, and G.E the set of its edges. 

For every node u £ G.V, let u.n be the distance between u and the source of the 
path, u.n the predecessor(s) of u in the minimal path, and u.v the set of nodes adjacent 
to u (accounting for the edge directions) 

In addition, a cost function c : vertex x vertex — » real is defined such that c(u, v) is 
the cost of the edge (u, v ). If (tt, v) ^ G.E, then c(u, v ) = 00 . 

The algorithm in table 3 uses the Djikstra’s shortest path algorithm to build a func- 
tion graph that produces a given set of output from a given set of input, if such a graph 
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exists: the function dijkstra(G, c, s) returns the set of nodes in G where, for each node 
n, n.K is set to the cost of the path from s to n according to the cost function c. Dijkstra’s 
algorithm is a standard one and is not reported here. 



Table 3. Algorithm for the creation of function graphs. 

make_graph(J : {vertex}, O : {vertex}, G : graph, c : vertex x vertex — > real) : graph 
s, d : vertex; 

G.V := G.V U {s, d}\ 

f orall u in / do G.E := G.E U {(s, u)}; od 
f orall u in O do G.E := G.E U {(it, d)}; od ; 

S := dijkstra(G, c, s ); 

Q ■ graph; 

T, P : {vertex}; 

S:=S - {s, d}- Q.V := 0; T := O; P := O; 
while T ^ 0 do 
u := element (T); 
if u.v ^ s A u.v ^ 0 do 
f orall v in I do 
Q.V := Q.V U{u}; 
if v P do T :=TU{t} fi 
P -P U {«}; Q.E := Q.E U {(u, «)}; 

od ; 

T ~T-{u}\ 
fi: 

od ; 

return Q; 



4 Relaxing Some Assumptions 

The model presented so far is a way of solving a well known problem: given a set of 
functions, determine what other functions can be computed using their combination; 
our model is somewhat more satisfying from a modeling point of view because of the 
explicit inclusion of the cartesian product of data sorts and the function algebra opera- 
tors necessary to take them into account but, from an algorithmic point of view, what 
we are doing is still finding the transitive closure of a set of functional dependencies. 
We will now try to ease some of the restrictions on the data source. These extensions, 
in particular the inclusion of joins, can’t be reduced to the transitive closure of a set of 
functional dependencies, and therein lies, from our point of view, the advantage of the 
particular form of our model. 

Comparisons. The first limitation that we want to relax is the assumption that the data 
source doesn’t have the possibility of expressing any of the predicates <j>i(Sn, Si 2 ) in 
the query (1 ). There are cases in which some limited capability in this sense is available. 




62 



Simone Santini and Amarnath Gupta 



We will assume that the following limitations are in place: firstly, the data sources pro- 
vides a finite number of predicate possibilities; secondly each predicate is of the form 
<j)(S, R ) = S op R, where S and R are fixed data sorts, and “op” is an operator that 
can be chosen amongst a finite number of alternative. The general idea here comes, of 
course, from an attempt to model web sites in which conditions can be expressed as part 
of “forms.” 

In order to incorporate these conditions into our method, one can consider them as 
data sorts: each condition cf>( S , R) is a data sort that takes values in the set of triples 
(s, r, op), with s of sort S and r of sort R. In other words, indicating a sort as a pair 
N : T, where N is the name and T the data type of the sort, a comparison data sort 
e(Si, S 2 ) is isomorphic to N\ : T\ x N2 : T 2 x (Ti x T 2 — * 2) where 2 is the data 
type of the booleans. A procedure that accepts in input a value of a data sort Si, and a 
condition on the data sorts S 2 , S 3 , would be represented as 



Si £(S 2 ,S 3 ) 




Si x <£(S 2 , S 3 ) 



r 

a 



( 5 ) 



The only difference between condition data sorts and regular data sorts is that conditions 
can’t be obtained as the result of a procedure, so that in a computation graph a condition 
should not have any incoming edge. 

Joins. Let us consider now the case in which the model of the functional data source 
consists of a number of relations. We can assume, for the sake of clarity, that there are 
only two relations in the model: 

R 1 (N 1 :T 1 ,...,N p :T p ) 

R 2 {M 1 :Q 1 ,...,M V :Q V ). w 

Each of these relations supports intra-relational queries that can be translated into func- 
tions and executed using the computation graph of that part of the functional source that 
deals with the data sorts in the relation. In addition, however, we have now queries that 
need to join data between the two relations. Consider the relations: Ri(Xi, X 2l X 3 ), 
R 2 (Y 1 , Y 2 , Y 3 ) and the following query: 

(A,B) : -R 1 (X,’x’,A),R 2 (Y,’y’,B),A = B. (7) 



We can compute this query in two ways. The first makes use of the following two 
correspondence functions: 



X 2 — ^ Xi x X 3 
F 2 x Y 3 — ^ Yi . 



( 8 ) 



To implement this query, we adopt the following procedure: 
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Procedure 3: 

i) use the computation graph of R\ to compute fi (’x’), returning a set of 
pairs (a: Xi,b: X 2 ); 

ii) for each pair (a, b) returned: 

11.1) compute / 2 (’y’, b) using the graph of R 2 , obtaining 
a set of results (c : Yi); 

11.2) for each c, form the pair (a, c ), and add it to the output. 

The procedure can be represented using a computation graph in which the graphs 
that compute /i and f 2 are used as components. Let us indicate the graph that computes 
the function / as: 

a © *0 (9) 

Then a join like that in the example is computed by the following diagram: 




The second possibility to compute the join is symmetric. While in this case we used 
the relation R\ to produce the variable on which we want to join and the relation R 2 to 
impose the join condition, we will now do the reverse. We will use the functions 



x y 3 

X 2 x X 3 — f ^X 1 . 



(ID 



and a computation diagram similar to the previous one. Checking whether the source 
can process the join, therefore, requires checking if either the pairs of functions (/i, f 2 ) 
or (/ 3 , /zj) can be computed. The concept can be easily extended to a source with many 
relations and a query with many joins as follows. 

Take a conjunctive query, and let J = { Ji, . . . , J n } the set of its joins, with ./, : 
(X,j = Yi). We can always rewrite a query so that each variable X will appear in only 
one relation, possibly adding some join conditions. Consider, for example, the fragment 
R(A, X ), P{B 1 X ), Q{C 7 X), which can be rewritten as 



R(A, X 1 ), P(B, X 2 ), Q(C, X 3 ),X 1 = X 2: X 2 = X 3 . (12) 

We will assume that all queries are normalized in this way. Given a variable X, let s(X) 
be the relation in which X appear. Also, given a relation R in the query, let if I!) the 
Cartesian product of its input sorts, and o(R) the Cartesian product of its output sorts. 
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Table 4. Algorithm for the verification of the join conditions. 

check}/ : {vertex}, O : {vertex}, G : graph, c : vertex x vertex — > real) : real 
s, d : vertex; 

G.V := G.V U {s,d}; 

f orall u in / do G.E := G.E U {(s, it)} od ; 
f orall u in O do G.E := G.E U {(u, d)} od ; 

S := dijkstra(G, c, s); 
return ( d.K < oo); 



The algorithm for query rewriting is composed of two parts. The first is a function 
that determines whether a function from a given set of input to a given set of outputs can 
be implemented, and represented in Table 4. The second finds a join combination that 
satisfies the query. It is assumed that a set of the join conditions that appear in the query 
J = {(Xi, Yi), . . . , (X n , Y n )} is given. The algorithm, reported in table 5 returns a 
computation graph that computes the query with all the required joins. 

Table 5. Join determination algorithm. 

joins} J : {vertex x vertex}, G : graph, c : vertex x vertex — > real) : graph 

1 . / := 0 

2. f orall ( X , Y) in J do 

R--s(X)-Q~s(Y)- 

if check(if_B), o(R) U {X}, G, c) A check(i(Q) U {Y}, o(Q), G, c) do 
I:=/U{(X,Y)} 

elseif check(i(Q), o(Q) U {Y}, G, c) A check(i(7?) U {R}, o (R), G, c) do 
/:=/U{(Y,X)} 
f i od ; 

3. Q := make.graphOJ, i (Ri) U {X\{X, Y ) € /}. {J, o (Ri) U {Y|(X, Y) £ /}, G, c); 

4. f orall u in O do Q.E := Q.E U { (X. Y)} od; 

5. if cycle(Q) do error f i ; 

6. return Q; 



The correctness of the algorithm is proven in the following proposition: 

Proposition 1. Algorithm 1 succeeds if and only if the query with the required joins 
can be executed. 

The proof can be found in [5]. 

While the algorithm “joins” is an efficient (linear in the number of joins) way of 
finding a plan whose correctness is guaranteed, finding an optimal plan is inherently 
harder: 

Theorem 1. Finding the minimal set of functions that implements all the joins in the 
query is NP-hard. 

Proof. We prove the theorem with a reduction from graph cover, let G = (V, E) be 
a graph, with sets of nodes V = {iq, . . . , v n }, edges E = {ei, . . . , e m }, and with 
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d = (wq ,Vi 2 ), Vi x , Vi 2 £ V. Given such a graph, we build a functional source and a 
query as follows. 

For each node v 2 define a sort X, and a function f : I —> Xj. All the sorts are 
of the same data type. For each edge ( Vh,i’k ) define a condition Xh = Xk- Also, 
define a function g : X t / ■ ■ ■ •/■ X n — > Y. Finally, define the relations H \ (I, X - t ), 
R 2 (X 1 ,X 2 ), . . . , Rn+ i(x n , Y) and the query 

ans(F) : -R X {I, XJ, R 2 (X U X 2 ), . . . , R n+1 (X n , Y), 

I='i',X u =X l2 ,...,X mi =X m2 (13) 

where the equality conditions are derived from the edges of the graph. The reduction 
procedure is clearly polynomial so, in order to prove the theorem we only need to prove 
that a solution of graph cover for G exists if and only if a cost-bound plan can be found 
for the query. 

1. Suppose that a query plan for the query exists that uses B + 1 functions: P = 
{/i, . . . , /b, 3} (the function g must obviously be part of every plan, since it is the 
only function that gives us the required output Y). Consider the set S = {v-i \ f t £ 
P}, which contains, clearly, B nodes, and the edge (v-g , v% 2 ) of the graph. This edge 
is associated to a condition X tl = X, 2 in the query and, since the query has been 
successfully planned, either the function f tl or fi 2 are in the plan. Consequently, 
either v 2l or iq 2 are in the set, and the edge (v tl , Vi 2 ) is covered. 

2. let now S = {iq, . . . , Vb} be a covering and consider the plan P = {fi\vi £ S'} U 
{(/}. The output is clearly produced correctly as long as all the join conditions are 
satisfied, let X il = Xi 2 be a join condition. This corresponds to an edge ( v ^ , Vi 2 ) 
and, since S' is a covering, either or iq 2 are in S. Assume that it is v ^ (if it 
is ig 2 we can clearly carry out a similar argument). Then the plan contains the 
function f ri , which computes X tl so that the variables X lt and X t2 and the join 
are determined by the following graph fragment 







X L x ■ ■ ■ x X n 



V 

o 



5 Related Work 

The idea of modeling certain types of functional sources using a relational fagade (or 
some modification thereof) is, of course, not new. The problem of conciliating the broad 
matching possibilities of a relation with the constraints deriving from the source has 
been solved in various ways the most common of which, to the best of our knowledge, 
is by the use of adornments [6, 7], which also go under the name of binding patterns. 
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Given a relation R(X i , . . . , X n ), a binding pattern is a classification of the variables 
Xi, . . . . X n into input variables (which must be “bound” when the relation is accessed 
in the query, hence the name of the technique), output variables (which must be free 
when the relation is accessed), and dyadic variables, which can be indifferently inputs 
or outputs. Any query that accesses the relation by assigning values to the input vari- 
ables and requiring values for some or all the output variables can be executed on that 
relation facade. A relational fagade can, of course, have multiple binding patterns. If 
the relational facade is used to model an n - ary relation isomorphic to it, for instance, it 
allows all the 2” possible bound/free binding patterns on its variables or, equivalently, 
all its variables are dyadic. In the following, a binding pattern for any n-ary relation 
will be represented as a string b £ {i, o, d} (where i, o, and d stand for input, output, 
and dyadic, respectively, although dyadic variables will not appear in the examples that 
follow). Unlike our techinque, which determines query feasibility at run time, binding 
patterns are determined as part of the model. This difference results in a number of 
limitations of binding patterns, some examples of which are given below. 

Multiple relations with hidden sorts. Consider a source with five sorts, X, Y , P, W, Q, 
and the functional dependencies shown in the following diagram 

p (15) 




We want to model this source as a pair of relations: Ri(X,Y) and li^iP- Q), while 
the sort W should not be exported. Considering the two relations and the functions 
needed to answer queries on them, we can see that the relation Ri has two bind- 
ing patterns: (z,o) and (o, i), while f ?2 has only (o, i). A query such as “ans(Q) : 
— Ri(X, y), R 2 {X, Q)” would be rejected by the binding pattern verification system 
because R\ produces a set of X values from the query constant y, but li -2 can’t take the 
X's as an input. Mapping the query to a functional diagram, however, produces 

(Y<-y) x (16) 



fxl 



W ^ Q 
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which is computable. Therefore, the query can be answered using the model presented 
in this paper. 



Non-binding conditions. Binding patterns are based, as the name suggests, on the idea 
of binding certain variables in a relation, that is, on the idea of assigning them specific 
values. Because of these foundations, binding pattern models are ill-equipped to deal 
with non-binding conditions (that is, essentially, with all conditions except equality and 
membership in a finite set). 

As an example, consider a source with three sorts, A, B, and C, and a function 
A x B — > C. in addition, the source has a comparison capability which allows it to 
compare B with a fourth sort D and return C’s for a specified value of A such that a 
specified condition €(B, D) is verified: A x Iff B, D) — » C . the diagram of this source 
is: 




Because the condition OAB. D) is non-binding, it doesn’t contribute any binding pattern 
to the relation R(A , B , C ) for which the only binding pattern is, therefore, (i, i, o). A 
query such as “ans(C) : - R( a. B. C). B < v” where “<” is one of the operators al- 
lowed for £(B, D) is not allowed in the binding pattern model, while it can be executed 
with the model presented here. 

These examples highlight an important general difference between methods, such 
as binding patterns, that encode the satisfiability of functional constraints in the model, 
and methods such as ours that verify them when a query is executed: the latter class of 
methods can take advantage of rewriting opportunities that arise from the specific form 
of the query, even if they do not apply to a class of queries that can be identified at 
modeling time. 



6 Conclusions 

In this paper, we have considered the modeling of data sources for which a relational 
model doesn’t apply, but that can be as a set of functions that, given certain constants 
and certain conditions, return a set of “corresponding” values. We were interested in 
modeling these sources as relations, and to find algorithms to translate queries on these 
relations into combinations of functions provided by the source. 

The simplest form of the model that we have presented here is, mutatis mutandis, 
an instance of the problem of finding the closure of a set of functional dependencies 
and, in this sense, a rather classic one. The framework introduced in that section, how- 
ever, allowed us to extend the formalism to other problems that either from a modeling 
point of view (the inclusion of data sorts representing conditions) of algorithmic (the 
inclusion of joins) go beyond the transitive closure problem. 
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In these conclusions, we would like to propose a further interpretation of the work 
presented here, an interpretation that, we believe, is more likely to generate interesting 
developments. The functions defined for the data source can be seen as atomic state- 
ments in a query planning language in which we want to translate our queries, and 
the function algebra that we have defined constitutes the structural statements of this 
planning language. 

The problem that we have is therefore that of “implementing” queries in a language 
of fixed structure, but whose primitives change from source to source. In this frame- 
work, we can start asking questions such as the optimal structure of the language in 
order to manage the variability of the statement while still preserving the possibility of 
easy optimization, or the minimal characteristics of the primitive statements that allow 
the creation of interesting plans. 

Finally, the nature of our method might make static planning (planning done sep- 
arately from the execution) impossible, because there is no a priori indication of what 
queries will be feasible and which won’t, it is not clear, at this time, whether static 
optimal planning is possible for sources with restrictions modeled this way, or if it is 
necessary to resort to some form of on-the-fly planning as the query is being executed. 

These we regard as promising future directions for our work. 
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Abstract. Roles are meant to capture dynamic and temporal aspects of real- 
world objects. The role concept has been used with many semantic meanings: 
dynamic class, aspect, perspective, interface or mode. This paper identifies 
common semantics of different role models found in the literature. Moreover, it 
presents a conceptual modelling pattern for the role concept that includes both 
the static and dynamic aspects of roles. A conceptual modelling pattern is 
aimed at representing a specific structure of knowledge that appears in different 
domains. In particular, we adapt the pattern to UML. The use of this pattern 
eases the definition of roles in conceptual schemas. In addition, we describe the 
design of schemas defined using our pattern in order to implement them in any 
object-oriented language. We also discuss the advantages of our approach over 
previous ones. 



1 Introduction 

Accurate and complete conceptual modelling is an essential premise for a correct 
development of an information system. Reusable conceptual schemas facilitate this 
difficult and time-consuming activity. The use of patterns is a key aspect to increase 
the reusability in all stages of software development. 

A pattern identifies a problem and provides the specification of a generic solution 
to that problem. The definition of patterns in conceptual modelling may be regarded 
in two different ways: conceptual modelling patterns and analysis patterns. 

In this paper, we distinguish between a conceptual modelling pattern that is aimed 
at representing a specific structure of knowledge encountered in different domains 
(for instance the MemberOf relationship), and an analysis pattern that specifies a 
generic and domain-dependent knowledge required to develop an application for 
specific users (for instance a pattern for electronic marketplaces). Authors do not 
always make this distinction. For example, to Fowler, in [9], patterns correspond to 
our conceptual modelling patterns while to Fernandez and Yuan, in [8], patterns cor- 
respond to our definition of analysis patterns. For a further discussion on analysis 
patterns see Teniente in [30]. 

The goal of this paper is to propose a conceptual modelling pattern for roles. A 
role is meant to capture dynamic and temporal aspects of real-world objects. There 
are some dynamic situations from the real world that are not well suited just with the 
basic modelling language constructs. For example, when we want to model situations 
where an entity can present different properties depending on the context where it is 
used. 
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Although definitions of the role concept abound in the literature of conceptual 
modelling [2][4][6][9][14][25][26] a non-uniform and globally accepted definition is 
given. Roles are difficult to represent. They are not merely reified names for the par- 
ticipants in events. As we show in section 3, they can neither be represented as sub- 
types of other entity types even assuming multiple classification and inheritance. 
Rather, roles have their own characteristics that require them to be specified with a 
particular language construct in conceptual schemas. 

We identify common semantics of role models found in the literature and present a 
pattern that fulfils them. The use of this pattern eases the definition of roles in con- 
ceptual schemas. Moreover, we also discuss the design and the implementation of 
conceptual schemas that use our pattern to facilitate their implementation in object- 
oriented languages. We adapt the pattern to UML [20]. As far as we know, ours is the 
first approach that allows the definition of roles by using the standard UML. 

The rest of this paper is organized as follows: the next section presents the Roles 
as entity types Pattern. Section 3 comments related work and compare it with our 
proposal. Finally, conclusions and further work are presented. 



2 Roles as entity types Pattern 

In order to describe the role pattern we adopt the template proposed by Geyer-Schulz 
and Hahsler in [12] to describe conceptual modelling patterns (called by the authors 
analysis patterns). They adopt a uniform and consistent format, in contrast to Fowler 
in [9] who uses a very free format for pattern writing. Geyer-Schulz and Hahsler 
stress that adhering to a structure for writing patterns is essential since patterns are 
easier to teach, learn, compare, write and use once the structure has been understood. 

Their template preserves the typical context/problem/forces/solution structure of 
design patterns but adapted for the description of conceptual modelling patterns. The 
template includes the following sections: (1) Pattern Name. (2) Intent: what the pat- 
tern does and what problems it addresses. (3) Motivation: a scenario that illustrates 
the problem and how the pattern contributes to the solution in the specific scenario. 
(4) Forces and Context that should be resolved by the pattern. (5) Solution: descrip- 
tion of all relevant structural and behavioural aspects of the pattern. (6) Conse- 
quences: how the pattern achieves its objectives and the existing trade-off. (7) Design 
and implementation: how the pattern can be realized in the design stage. (8) Known 
uses: examples of the pattern. 

Note that, in the same way design patterns include the outline of possible imple- 
mentations of the pattern [11], our conceptual modelling pattern includes the outline 
of the design of the pattern. 

Following this template, next sections present the Roles as entity types Pattern. 

2.1 Intent 

The intent is the representation of roles that entities play through their life span and 
the control of their evolution. 
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2.2 Motivation 

The role concept appears very frequently in conceptual modelling. However, the 
possibilities that offer conceptual modelling languages to deal with them are very 
limited and cover only a small part of the role features (see, for example, what UML 
supports in [7] and [28]). 

There is not a uniform and globally accepted definition of roles. We illustrate here 
some of the most relevant ones: 

• “Role classes capture the temporal and evolutionary aspects of the real-word ob- 
jects”, Dahchour et al. in [6], 

• “Roles allow an object to receive and send different messages at different stages 
of evolution”, Pernici in [25]. 

• “Role is a defined behaviour pattern which may be assumed by entities of differ- 
ent kinds”, Bachman and Daya in [1], 

• “Roles are founded; defined in terms of relationship to other things, and lacks of 
“semantics rigidity” (something is semantically rigid if its existence is tied to its 
class)”, Guarino in [14], 

To summarize the above definitions, we could say that roles are useful to model 
the properties and behaviour of entities that evolve over time. The entity type Person 
is an illustrative example. During his or her life, a person may play different roles, for 
example he or she may become a student, an employee, a project manager, and so 
forth. Besides this, a person may have different properties and behaviour depending 
on the role or roles he/she is playing in a certain instant of time. 

For instance, consider the following scenario: let Maria be a person who starts 
studying at a University (Maria plays the role of student). After some years of study 
she registers to a second university degree (Maria plays twice the role of student) and 
starts to work in a company (Maria plays the role of employee). In that company she 
may become a project manager (now, Maria through her employee role, plays the role 
of project manager). Note that, in this scenario, if we ask for the telephone number of 
Maria, the answer is not trivial since depending on the role she is playing it may be 
her personal or her company phone number. 

Taking into account the complexity of the notion of role and the lack of support for 
roles in present conceptual modelling languages, it is clear that a pattern to define 
such a common construct is needed in conceptual modelling. 



2.3 Forces and Context 

Our definition of the role concept is refined by describing the set of features that roles 
must meet, most of which have been identified by Steinmann [27]. In our case, these 
features are the forces that influence and should be resolved by the pattern. 

We describe them using some examples over the scenario introduced above: 

1. Ownership. A role comes with its own properties [16] [6] f 1 5] [3 1 ], i.e., an instance 
of Employee has its own properties which may be different to the ones of the 
entity type that plays such a role. 
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2. Dependency. An instance of a role is related to a unique instance of its entity type 
and its existence depends on the entity type to which it is associated to [16][4][6], 
i.e., it is not possible to have an instance of Student not related to an instance of 
Person. 

3. Diversity. An entity may play different roles simultaneously [16] [6] [ 1 3] [25] [15] 
[31] [32], i.e., an instance of Person may play simultaneously the role of Student 
and Employee. 

4. Multiplicity. An instance of an entity type may play several instances of the same 
role type at the same time [16][6][13][15] [25] [3 1] [32] . For instance, a person that 
registers to more than a University have multiple instaces of Student related to it. 

5. Dynamicity. An entity may acquire and relinquish roles dynamically [ 1 ] [ 16] [6] 
[ 1 3] [ 1 5] [23], i.e., a person may become a student, after some years become an 
employee, finish his/her studies, become a project manager, start another degree 
and so forth. 

6. Control. The sequence in which roles may be acquired and relinquished can be 
subject to restrictions [6] [25] [31], i.e., a person may not become an employee 
when he/she is older than 65 years. 

7. Roles can play roles [16] [4] [6] [31] [32]. This mirrors that an instance of Person 
can play the role of Employee and an instance of Employee can also play de role 
of ProjectManager. 

8. Role identity [31]. Each instance of a role has its own role identifier, which is 
different from that of all other instances of the entity to which is associated with. 
This solves the so-called counting problem introduced by Wieringa et al in [31]. It 
refers to the fact that we need to distinguish the instances of the roles from the 
instances of the entity types that play them. For example, if we want to count the 
number of people that are students in a university (i.e. every person who is 
registered to at least a program in such university), the total number is less than 
the number of registered students in such university (in this case a person is 
counted twice if he or she is registered at two programs). 

9. Adoption. Roles do not inherit from their entity types [16] [13]. Instead, instances 
of roles have access to some properties of their corresponding entities i.e.. Student 
may adopt name and address properties of Person but neither religion nor marital 
status properties. Therefore, the Student role cannot use the last two referred 
properties. 



2.4 Solution 

We divide the solution of our role pattern in two subsections. The first one deals with 
the structural aspects of roles while the second one deals with their evolution. 

2.4.1 Structural Aspects of Roles 

We believe there is not a fundamental difference between roles and entity types since 
roles have their own properties and identity. Therefore, we represent roles as entity 
types with their own attributes, relationships and generalisation/specialisation hierar- 
chies. For practical reasons we call role entity types (or simply role if the context is 
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clear) the entity types that represent roles and natural entity types 1 (or simply entity 
types) the entity types that may play those roles. 

We define the relationship between a role entity type and its natural entity type by 
means of a RoleOf relationship. This special relationship relates a natural entity type 
with a role entity type to indicate that the natural entity type may play the role repre- 
sented by the role entity type. In the relationship we also specify the properties (at- 
tributes and associations) of the natural entity type that are adopted by the role entity 
type. 

Note that, since roles may play other roles, the same entity type may appear as a 
role entity type in a RoleOf relationship and as a natural entity type in a different 
RoleOf relationship. 

Although this representation may be expressed in many conceptual modelling lan- 
guages, in this work, we only adapt it to UML. In particular, we use UML 2.0 [20] 
and OCL 2.0 [19] versions. 

To be able to represent the RoleOf relationship we use the extension mechanisms 
provided by UML, such as stereotypes, tags and constraints. Stereotypes allow us to 
define (virtual) new subclasses of metaclasses by adding some additional semantics. 
A stereotype may also define additional constraints on its base class and add some 
new properties through the use of tags. 

The «RoleOf» stereotype allows us to define a RoleOf relationship between the 
natural and role entity types. The base class of the stereotype is the Association meta- 
class, which represents association relationships among classes. The «RoleOf» 
stereotype also includes the properties 2 the role adopts from the natural entity type. 
They are represented with a multivalued tag, called adoptedProperties. We may pack 
this stereotype in a new UML Profile [20] for Roles. Figure 1 shows the definition of 
the «RoleOf» stereotype. 



«stereotype» 

RoleOf 


«stereotype» 


«Metaclass» 

Association 


adoptedProperties[*]: String 





Fig. 1. Definition of the RoleOf stereotype 



The multiplicity of the role towards its entity type is ‘ 1 ’ (since a role can only be 
related to a single instance of the entity type) and its settability is readonly (the role 
instance must always be related to the same instance of the entity type). 

As an example, figure 2 shows the extended example introduced in section 2.2 
specified in UML. The figure illustrates a natural entity type. Person, with its own 
properties, playing two roles: Student and Employee. The role Student is a generalisa- 
tion of domestic and foreign students. The role Employee may play also the role of 
ProjectManager, who manages a set of tasks. Student adopts properties name, phone 



1 The natural entity type of a role relationship has sometimes been called object class [6] [31] 
ObjectWithRoles [13], natural type [14] [27], base class [4], entity type [1], entity class [2], 
base role [24], or core object [3]. 

2 A property in UML 2.0 [20] represents both the attributes and associations of an entity type. 
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number and country (represented as attributes) and address (represented as an asso- 
ciation) from Person, and Employee adopts the name and the derived age attribute. 
ProjectManager adopts name, employee number and the contract expiration date 
from Employee. 

Note that Employee has its own phone number different from the Person’s phone 
number, i.e., Employee does not adopt the phone number attribute from Person. 
Therefore the answer to the question: “which is the phone number of Maria?” will 
vary depending on whether we are considering Maria as an instance of Person or 
Employee. The stereotyped operations shown in the figure will be taken up in the 
following section. 



Employee 



employee#: Integer 
category: String 
phone#: PhoneNumber 
state: String 
expirationDate: Date 



«IniIC» mayBeHired() 
«DelIC» mayBeFiredQ 



{ readonly 



0..2 



«RoleOf» 

{ adoptedProperties = 
name, age} 



1 

{ readonly 



Person 



name: String 
phone#: PhoneNumber 
birthDate: Date 
country: String 
/age: Integer 



1 

«RoleOf» 

{ adoptedProperties = name, 
employee#, expirationDate } 



ProjectManager 


1 * 


Task 


projectName: String 
startDate: Date 


taskName: String 
startDate: Date 
dueDate: Date 
cost: Integer 




«IniIC» 

notTooManyPendingTasks () 







Address 

street: String 
number: Integer 
ZIPcode: String 



{readonly} 

«RoleOf» 

{ adoptedProperties = name, 
address, phon e#, co untry } 



Student 



student#: Integer 



x 



1 . 



University 



name: String 



DomesticStudent ForeignStudent 



x 



Fig. 2. Example of RoleOf relationhips in the UML 



To complete the definition of the static aspects of roles we must attach some con- 
straints to the «RoleOf» stereotype in order to control the correctness of its use. 

The constraints are the following: 

• A stereotyped «RoleOf» association is a binary association with multiplicity 
1 1 ’ and settability readonly in a member end. 

• Each value of the adoptedProperty tag must coincide with the name of a property 
of the natural entity type. 

• A role entity type can only be related throughout a RoleOf relationship to at most 
a natural entity type. 

• No cycles of roles are permitted; a role entity type may not be related throughout 
a direct or indirect RoleOf relationship to itself. 

Adopted properties by the role from its natural entity type may be considered as 
implicit properties of the role entity type. Nevertheless, in order to facilitate the use of 
these adopted properties (for instance, when writing OCL expressions) we may need 
to include them explicitly in the role entity type. In this case, we add an extra prop- 
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erty in the role entity type for each adopted property. These extra properties are la- 
beled with the «adopted» stereotype to distinguish them from the own properties 
of the role entity type. In addition, they are derived. Their derivation rule always 
follows the general form: 

context RoleEntityType::adoptedPropertyX: Type 
derive: naturalEntityType.propertyX 

Note that, to facilitate the work of designers, these added properties can be auto- 
matically generated. Figure 3 extends a subset of the previous example illustrating the 
Student role entity type including its adopted properties. 




Fig. 3. Example of the Student role entity type 

2.4.2 Role Acquisition and Relinquishment 

So far, we have introduced a representation of the static part of the Roles as entity 
types Pattern. Nevertheless, this is not enough since role instances may be added or 
removed dynamically from an entity during its lifecycle and this addition or removal 
may be subjected to user-defined restrictions. 

Since roles are represented as entity types we may define constraints on roles in 
the same way as we define constraints on entity types. Some of the constraints are 
inherent to our role representation (for example, that a person must play the role of 
Employee to play the role of ProjectManager, is already enforced by the schema). 
Other restrictions involved may be expressed by means of the predefined constraints 
of UML. For example, to restrict that an Employee cannot play more than twice the 
ProjectManager role simultaneously, it is enough to define a cardinality constraint in 
the relationship. The definition of the rest of constraints requires the use of a general- 
purpose language, commonly OCL in the case of UML. For instance, we could spec- 
ify OCL constraints to control that: 

• A person can only play the role of Employee if he/she is between 18 and 65 years 
old: 

context Employee inv: 
self.age> = 18 and self.age<=65 

• Any task of a ProjectManager must finish before his contract expires 

context Task inv: 

self. dueDate <self. projectManager. expirationDate 
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These OCL constraints are static, and thus, the role instances must satisfy them at 
any time. However, many of the restrictions that may be involved in the evolution of 
roles only apply at particular times, concretely they only need to be satisfied when the 
role is acquired or when it is relinquished. To specify such constraints we use the 
notion of creation-time constraints defined by Olive in [18] and, in a similar way, we 
define the deletion-time constraints. 

Creation-time constraints must hold when the instances of some entity type are 
created (in our case when the role is created). Deletion-time constraints must hold 
when the instances of some entity type are deleted (in our case when the role is de- 
leted). These constraints are represented as operations, also called constraint opera- 
tions, attached to the entity types and identified by a special stereotype. The creation- 
time constraint operations are marked with the stereotype «IniIC». We define the 
stereotype «DelIC» for the deletion-time constraint operations. 

These operations return a boolean that must be true to indicate that the constraint is 
satisfied. If the operation returns false (i.e., the constraint is not satisfied) then the 
creation or deletion event of the role is not accomplished. When appropriate, the 
operations are automatically executed by the information system. 

As an example, we have defined the following restrictions in figure 2: 

• A person cannot become an employee if he/she is studying two university degrees 
simultaneously. Note that this does not imply that a person that is already an 
employee may apply for two degrees. 

context Employee :: mayBeHired () : Boolean 
body: self. person. student->size()<2 

• An employee may not be fired if he or she is in maternity leave. 

context Employee :: mayBeFired () : Boolean 
body: selfstateo’ Maternity Leave' 

• An employee may not become a new project manager if he/she still holds more 
than ten pending tasks. 

context ProjectManager: : notTooManyPendingTasks(): Boolean 
body self. employee. projectManager. tasks-> 

select(dueDate>Today)->size()<=10 



2.5 Consequences 

Our pattern of roles achieves the objectives proposed in Section 2.3 since it fulfils the 

role features outlined before: 

• Ownership. As roles are represented as entity types, they may have their own 
properties. 

• Dependency. The cardinality ‘1’ with the tag [readonly] ensures that all role 
instances depend on a unique instance of the natural entity type. 

• Diversity. As the RoleOf relationship is an association, entity types may have 
many RoleOf relationships. 

• Multiplicity. This is obtained by the cardinality at the RoleOf relationship. 

• Dinamicity. Entities are related to their roles through an association. Thus, an 
entity may acquire or retract instances of a role many times. 
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• Control. The sequence in which roles may be acquired and relinquished can be 
subjected to restrictions. 

• Roles can play roles. Roles are represented by ordinary classes. So, they can be 
participants of a RoleOf relationship. 

• Role identity. As roles are represented as entity types, their instances have their 
own identifier. 

• Adoption. The cidoptedProperty tag of the RoleOf relationship allows the 
definition of this mechanism. 

A trade-off that one may find in our representation is that we do not consider that 
roles can be associated to different natural entity types. We believe this situation may 
be solved by defining a common supertype for all the natural entity types that play 
such role. For instance, if we need Client to be role of both Company and Person 
(understood as a physical person), we could define a common supertype for Company 
and Person, called LegalPerson, which plays the role of Client. 

On the other hand, we do not allow roles to remain unconnected to any entity, as 
for instance. Employee understood as a vacant position not played by any Person. 
This approach is commonly used when considering roles just as interfaces. We dis- 
cuss the limitations of this approach in Section 3. 

2.6 Design and Implementation 

There are some design patterns useful for designing and implementing roles in object 
oriented languages [9]. However, most of them are unable to deal with our proposed 
role semantics completely. A well-known pattern close to our role defined semantics 
is the Role Object Pattern [3]. This pattern is especially well suited for role imple- 
mentation when roles are deemed as a specialization (or a kind of specialization) of 
its entity type (see Pelechano et al. in [24] as an example). 

Nevertheless, this pattern is not entirely appropriate for designing our conceptual 
modelling pattern. We encounter two main problems in the Role Object Pattern. First, 
it uses a common superclass for all the roles of the entity type. In our approach, roles 
are independent entity types with not necessarily any common properties that justify 
this superclass. Secondly, all the roles are forced to have the same inherited proper- 
ties; it is not possible to define different adopted properties for each role. 

This is the reason why we advocate here for an adapted version of this pattern that 
takes into account our complete role semantics, including the adoption mechanism 
and the creation-time and deletion-time constraints. 

Given a natural entity type and the set of its roles, we create a class for the natural 
entity type and a class for each role. We create a different relationship between the 
natural entity type and each of its roles. This relationship will be used to navigate 
from the natural entity type to its roles and vice versa. We add to the natural entity 
type two new operations addRole and deleteRole in charge of adding (deleting) roles 
to the natural entity after checking the creation-time (deletion-time) constraints. We 
could also add other useful operations when dealing with roles, such as hasRole (to 
check whether an entity plays a role) or getRole (to obtain a role played by the entity). 
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The problem of the design of the adopted properties may be regarded as the same 
problem as designing derived information. In general, from a design and/or imple- 
mentation point of view, there are two different approaches to deal with derived in- 
formation. The attributes may be computed if they are calculated by means of an 
operation or may be materialized if they are explicitly stored in the class. In this case, 
for each adopted property we add an extra operation to the role class that returns the 
value of the property of the natural entity type. The operation accesses the property of 
the natural entity type navigating through the relationship. 

Figure 4 summarizes our proposal. In figure 5 we apply the proposed design pat- 
tern to a part of the conceptual schema of figure 2. Note that Employee is both a role 
for the Person entity type and a natural entity type for the ProjectMcinager role, and 
thus, it presents both a reference to Person (as a role entity type) and the operations 
addRole and deleteRole (as a natural entity type). Additionally, Employee includes 
also the name and age operations to get this information from Person. 
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Fig. 4. Summarized class diagram of the design 



This structure can be directly implemented in any common object-oriented lan- 
guage. An example of the implementation in the Java Language can be found in [4], 




Fig. 5. Example of an application of the design 
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2.7 Known Uses 

The role concept appears frequently in many different domains of the real world, 
since in each domain we can find entity types that present some properties that evolve 
over time. 

Papazoglou et al. in [23] note that roles can be useful for several types of applica- 
tions based on the use of object-oriented technology and they describe two examples 
of broad types of application that need role support: security and workflows. Some 
more examples are discussed by Jodlowski et al. in [15]. 

3 Related Work 

Previous research can be grouped in four basic approaches to represent roles. We 
discuss the major drawbacks of each approach according to our role defined seman- 
tics. However, they may suffice when considering more limited semantics. 

The first approach represents a role as a label assigned to a participant in an event 
[20]. This representation does not achieve our objectives because roles come with 
their own properties different from those of the entity types playing them, which 
cannot be defined within the label. 

A second approach considers that roles and entity types can be combined into a 
single hierarchy [1][4][26], Role entity types are represented as subtypes of the natu- 
ral entity type. For instance, if Person were a natural entity type, then Student, Em- 
ployee and ProjectManager roles would appear as subtypes. Quite obviously, such a 
solution requires dynamic and multiple classification, since a person can change 
his/her role and play several roles simultaneously. However we would like to make 
emphasis of three important features that specialization does not cover. First, what we 
have defined as multiplicity: an entity may play the same role more than once at the 
same time (i.e., specializations does not allow to define a Person playing simultane- 
ously twice the role of Employee). The second one is adoption; with specialization we 
cannot restrict which attributes are adopted by the roles because they inherit all the 
attributes of their supertype. And finally, with specialization the role and the entity 
type have the same identifier, therefore the counting problem mentioned before is not 
solved. A further discussion on this topic can be seen in [27] and [15]. 

A third approach suggests that roles are only partial specifications of the entities 
playing them, and then the features of roles are the very features of interfaces (inter- 
faces as types, in the sense of Java and UML) as Steimann in [29] or Lea and Mar- 
lowe in [17]. This alternative does not solve the multiplicity feature since an entity 
may not play the same interface more than once at the same time. On the other hand, 
roles do not have their own separated state (the whole state is shared in the natural 
entity type), since interfaces do not have their own attributes. Besides, when an in- 
stance of the natural entity type is created it acquires automatically all the roles, and 
thus, we cannot control nor restrict the evolution of the roles the instance of the natu- 
ral entity type plays. Therefore, we consider that interfaces do not cover everything 
one might expect from the role concept. 

The last approach, and also our approach, represents a role as a distinct element 
from an entity type but coupled to it [6] [16] [22] [25] [27], However, most of these 
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approaches use different semantics that the ones presented in this paper. For instance, 
some solutions are based on the fact that the instance of a natural entity type and its 
role instances share the same object identifier as Papazoglou et al in [22], among 
others. These solutions neither solve the counting problem mentioned before. Others, 
as Pernici in [25] do not allow roles to play roles. Our alternative suggesting roles as 
separated entity types fulfils the role semantics. 

We believe one of the main advantages of our approach over previous ones is that 
we handle the complexity of role semantics in a very simple manner since we repre- 
sent roles and its evolution with already existing elements (entity types and con- 
straints) without adding completely new language constructs. Therefore, the designer 
can easily use the pattern to specify roles in conceptual schemas. In addition to this, 
our pattern describes a representation of roles in the standard UML, and thus, the 
pattern can be directly incorporated into current UML CASE tools. 

We would also like to remark that our approach is complete and feasible in the 
sense that includes the design and the implementation of the pattern, in contrast to 
most of previous approaches that do not state how this could be achieved. 

4 Conclusions and Further Work 

This paper identifies the most important features of roles and presents the Roles as 
entity types pattern, a conceptual modelling pattern for roles. We have adapted the 
pattern to allow the specification of roles in UML conceptual schemas. To our knowl- 
edge, ours is the first standard extension to UML to define roles in conceptual sche- 
mas in this language. The pattern can be easily implemented in any UML CASE tool 
in order to allow designers to use the role concept. 

The pattern includes the static aspects of roles as well as their evolution. We define 
roles as entity types (role entity types) related to natural entity types by means of a 
RoleOf relationship that includes the adoption of properties from the natural entity 
types by the role entity types. We have extended UML by means of the «RoleOf» 
stereotype to be able to represent such kind of relationships. To specify the role evo- 
lution we use two special kinds of constraints: creation-time constraints and deletion- 
time constraints. We have also discussed the design and implementation of concep- 
tual schemas specified using the pattern. 

It would be interesting to study which taxonomies appearing in conceptual sche- 
mas should be better specified by using RoleOf relationships. This could be done by 
comparing the specification of the same case study with and without the use of roles. 
Moreover, we would like to automatize our approach by means of an application that 
given a conceptual schema (for instance, represented in XMI [21]) would generate 
automatically the corresponding classes in the target object oriented language. These 
are directions in which we plan to continue our work. 
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Abstract. Our goal is to model the way people induce knowledge from 
rare and sparse data. This paper describes a theoretical framework for 
inducing knowledge from these incomplete data described with concep- 
tual graphs. The induction engine is based on a non-supervised algo- 
rithm named default clustering which uses the concept of stereotype and 
the new notion of default subsumption, the latter being inspired by the 
default logic theory. A validation using artificial data sets and an appli- 
cation concerning an historic case are given at the end of the paper. 



1 Introduction 

We aim to model the way it is possible to induce knowledge from rare and 
sparse data using a default reasoning. In this way, we propose a new induction 
engine using non-supervised learning techniques and the conceptual graph for- 
malism as described by J. Sowa [1], The induction mechanism is based on the 
notion of default subsumption, the latter having been inspired from the default 
logic theory of R. Reiter [2] . This new model has been designed both to deal 
with heterogeneous and incomplete databases and to understand the way people 
build stereotypes from incomplete information, as it can be found for example 
in newspaper articles. 

On the one hand, such databases have to be automatically completed with 
a default reasoning to become comparable. On the other hand, our hypothesis 
is that popular inductions are not only due to the lack of facts, but also to the 
poor description of the existing facts. This sparseness is particularly favorable 
to the use of background knowledge -like theories- and to the elaboration of 
caricatural representations we called stereotypes. 

In such cases, we simulate the erection of categories from incomplete informa- 
tion by using machine learning and data mining techniques. These categories are 
formed with a new relation we introduce, the default subsumption, and named 
thanks to the concept of stereotype, which is defined below. In the past, some 
meaningful results have been obtained by using supervised learning techniques 
and applying them to model pre-scientific reasoning both in the field of medicine 
and in some cases of dissemination of social misrepresentations [3,4]. 
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Our paper is divided in two parts. The first section presents the logical frame- 
work modeling inductive reasoning from sparse descriptions. This framework 
makes use both of the notion of default subsumption, which is analogous to 
default logic, and of the concept of stereotype, which models the way sparse 
descriptions may be categorized. Details tools and strategies are then detailed 
in order to build such sets of stereotypes. The second section is dedicated to the 
validation of the model from artificial data sets and to a real application dealing 
with social misrepresentations. 

2 Logical Framework 

2.1 Default Logic 

During the eighties, there were many attempts to model deductive reasoning 
in presence of implicit informations. A lot of formalisms [5,6,2] have been de- 
veloped to encompass the inherent difficulties of such models, especially their 
non-monotony: close-world assumption, circumscription, default logic, etc. Since 
our goal here is to model the way people induce empirical knowledge from par- 
tially and non homogeneously described facts, we face a very similar problem: 
in both cases, it is to reason in presence of implicit information. Therefore, it is 
natural to make use of similar formalisms. 

In this case, we choose the default logic formalism, which were developed 
in the eighties by R. Reiter [2]. This logic for default reasoning is based on the 
notion of default rules, which permits to infer new formulas when the hypotheses 
are not inconsistent. More generally, a default rule has always the following form: 
A : B u B 2 , ■■■B n /C where A is called the prerequisite, Bi the justifications and 
C the conclusion. This default rule can be interpreted as follows: if A is known 
to be true and if it is consistent to assume B\ 1 B 2, ...B n then conclude C. 

For instance, let us consider the next default rule: 

politician(X) A introdiicedAbroad(X) : -i diplomat (X) / traitor (X) 

This rule translates a usual way of reasoning for people living in France 
during the end of the 19th century; it means that one can suspect all politicians 
who are introduced abroad to be traitors towards their own countries, except 
for diplomats. In other words, it expresses that the conclusion traitor(X) can 
be derived if X is a politician who is known to be introduced abroad while we 
cannot prove that he is a diplomat. 

Let us note that information conveyed by default rules refers to implicit con- 
notations. As example, the antinomy among patriots and internationalists or the 
rule that assimilates almost all the politicians involved with foreigners to traitors 
correspond to connotations and may facilitate the completion of partial descrip- 
tions. The key idea is that people have in mind stereotypes that correspond to 
strong images stored in memory and that partial descriptions evoke such stereo- 
types. The following sections are dedicated to this concept of stereotype; before, 
it is necessary to introduce the notion of default subsumption. 

In the rest of this section, we use the framework designed by Sowa in [1], A 
short introduction can be found at [7]. 
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2.2 Default Subsumption 

First of all, let us assume that a stereotype is a specific description, which will 
be in this paper a conceptual graph, and consider the description function S : 
F — > D which associates a conceptual graph 5(f) £ D to each fact / from the 
set of initial facts F. Let us next consider that a fact subsumes another fact if 
it is the result of the generalization operators. 

A stereotype stored in the memory is said to subsume a fact by default if it 
has no contradictory features with the description of the fact considered, i.e. 5(f). 
So it can be used to complete this description without adding any incoherence. 
In other words, the fact can be completed in such a way that its description can 
now be subsumed by the stereotype. 

Let us consider now the graph g associated with a fact in which there is a very 
large number of missing data. The missing data can be guessed and completed 
to obtain a more specific graph gs- We follow here the notations given by Sowa 
in [1] (definition 3.5.1): gs < g means that gs is a specialization of g and g is a 
generalization of gs, i.e. g subsumes gs- Now, let s be one stereotype belonging 
to the structured memory. If this stereotype is more general than gs, ie gs < s, 
then it subsumes g by default. More formally: 

Definition 1. Let f be a fact represented by the conceptual graph g = 5(f) and 
s a stereotype, s subsumes g by default if and only if there exists a graph gs 
with gs < g and gs < s. gs is therefore a graph formed by the join operator 
performed on the graphs g and s. 

Fig. 1 presents the fact The meal of Jules is composed of steak, red wine, 
and ends with a cup of very hot coffee which can be subsumed by default by 
the stereotype The meal is composed of steak with potatoes and French bread, 
and ends with a cup of coffee because the fact can be completed to The meal of 
Jules is composed of steak with potatoes and French bread, red wine, and ends 
with a cup of very hot coffee. If the stereotype had presented a meal ending with 
a liqueur, it would not match the fact and so could not subsume it by default. 

Property 1. The notion of default subsumption is more general than that of 
classic subsumption. Let g and g' be two conceptual graphs. If g subsumes g' 
then g subsumes g' by default. 

Property 2. The default subsumption is a symmetrical relation. Let u\ and U 2 
be two conceptual graphs. If u\ subsumes U 2 by default, then U 2 subsumes u\ 
by default too. 

Let us note that the notion of default subsumption may appear strange for 
people accustomed to classical subsumption since it is symmetrical. As a conse- 
quence, it does not define an ordering relationship on the space of description. 

2.3 Concept of Stereotype 

Eleanor Rosch saw the categorization itself as one of the most important issues 
in cognitive science [8]. She observed that children learn how to classify first 
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Fact Stereotype 




Completed fact 




Fig. 1 . The stereotype subsumes by default the fact description. The description below 
is the result of the join operator, i.e. the completed fact. 



in terms of concrete cases rather than through defining features. She therefore 
introduced the concept of prototype as the ideal member of a category. Owner- 
ship to a class is then defined by the proximity to the prototype and not by the 
number of shared features. 

For example, a robin is closer to the bird prototype than an ostrich, but they 
are both closer to it than they are to the prototype of a fish, so we call them both 
birds. However, it takes longer to say an ostrich is a bird than it takes to say a 
robin is a bird because the ostrich is further from the prototype. Sowa defines a 
prototype as a typical instance formed by joining one or more schemata. Instead 
of describing a specific individual, it describes a typical or “average” individual. 

From a computational point of view the concept of prototype is difficult to 
manage since many complete observations have to be considered in order to 
construct such an ideal fact. Furthermore, it is not really appropriate in order to 
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classify new observations and predict missing information. We therefore propose 
to adopt the concept of stereotype, which is quite close to that of prototype but 
more adapted to missing values. 

The concept of stereotype was introduced by W. Lippman in a book about 
public opinion [9]. We define it here as a specific and imaginary fact that com- 
bines features found in the facts it subsumes by default. Since there is no con- 
tradiction between a fact and its related stereotype, it may be used to complete 
its description. In other words, a stereotype s is said to cover a fact described 
by g if and only if s subsumes by default the graph g. As a consequence, g may 
be completed by the conceptual graph s. This point will be detailed later. 

2.4 Set of Stereotypes 

Groups of people share implicit knowledge, which makes them able to understand 
each other without having to express everything explicitly. This sort of knowledge 
can be expressed in terms of erudite theories (e.g. the “blocking perspiration” 
theory in [3]) or use a more “naive” formulation. Our second hypothesis is that 
this implicit knowledge can be stored in terms of sets of stereotypes. This means 
that many people have in mind the sets of stereotypes and that they use them 
to reason in a stereotyped way by associating new facts to stereotypes they have 
in mind. 

To formalize this idea, let first suppose that a description space D and a 
set of facts F are given. Then, a measure of dissimilarity Mn is defined on D. 
Previous work deals with graph matching and an interesting method to calculate 
the similarity between two conceptual graphs is proposed in [10]. However, in the 
present context we consider sufficient a simplier measure that adds the differences 
between graphs. 

First let us recall the definition of compatibility given in [1]: 

Definition 2 . Let conceptual graphs u\ and U2 have a common generalization 
v with projections : v — > u\ and tt2 : v — > U2- The two projections are said to 
be compatible if for each concept c in v, the following conditions are true: 

1. typefK\c) n type(ir2c) >_L. 

2. The referents of tt±c and n2C conform to type{n\c) fl fype^c). 

3. If ref erentfnic) is the individual marker i, then referent('K2c) is either i 
or *. 

We now consider that there is always only one least common generalization, 
i.e. only two projections that are compatible and maximally extended. It is easy 
to generalize our model with graphs having several least common generalizations. 

The following theorem is stated in order to link the notions of compatibility 
and default subsumption: 

Theorem 1 . Let conceptual graphs U\ and U2 have the least common general- 
ization v with projections wi : v — > u\ and tt 2 : v — > U2 . tt\ and tt 2 are compatible 
if and only if U\ subsumes U2 by default. 
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Proof. If 7r i and 7T2 are compatible then there exists a common specialization w 
of u\ and 112 (cf. theorem 3.5.7). According to definition 1, u\ subsumes U2 by 
default. Reciprocally, if u\ subsumes M2 by default then there exists a common 
specialization w. Suppose that 7Ti and 7r2 are not compatible. There therefore 
exists at least one concept in v with type('Kic)C\type('K2c) =_L, or with the referent 
of 7Tic or 7T2C not conform to type(TTic)ntype(TT2c), or with referent(nic) = i and 
re/erent(7r 2 c) = j, i fy j, These three cases are absurd because they contradict 
the construction of w. Therefore, ni and 7T2 are compatible. 

Consider now the measure Mp counting the dissimilarities between two 
graphs mi and M2. Let v be the least common generalization graph with pro- 
jections 7Ti : v — > Mi and 7T2 : v — > M2. If 7Ti and tt 2 are not compatible then 
the measure Md(mi,M 2) is fixed by convention with an infinite value noted M a 0 
because one graph can’t be subsumed by default by the second one (cf. theo- 
rem 1). Otherwise Md(ui,U2) counts all the differences between the concepts 
and relations of mi and those of M2. The measure is thus defined: 

Definition 3. Let conceptual graphs ui and M2 have the least common gener- 
alization v with projections ni : v — > Mi and 7r 2 : v — » M2. The measure of 
dissimilarities Md(mi,M2) is equal to: 

1. Mao if ft 1 and n2 are not compatible. 

2. C + T (u\) + T (11,2) otherwise, where: 

• C = \ { concept c £ v/type(nic) ^ type(ir2C ) orreferent(n\c) ^ 

referent(n2c)}\. 

• T(m) = card{u) — card{v); card(g) corresponds to the number of nodes 

(concepts and relations) of graph g. 

This measure presents the following properties: 

Property 3. For any conceptual graph m, Md(u,u) = 0. 

Property j. For any conceptuals graphs u and v, Mjj(u,v) = u). 

Let us now define what is a set of stereotypes: 

Definition 4. In the framework of the conceptual graphs, a set of stereotypes is 
a tuple of n graphs (si, S2, ...s n ). 

2.5 Completion of Facts 

Being given a set of facts F and a set of stereotypes (si, S2, ...s„), it is possible 
to complete the descriptions of almost any fact /. More precisely, the completion 
is possible when there exists at least one stereotype Sj belonging to the set of 
stereotype (si, S2, ...s n ) such that s, subsumes by default the description 6 (f) 
of the fact /. In other words, thinking by stereotypes is possible when new 
descriptions are so sparse that they seem consistent with existing stereotypes. 
This capacity to classify and to complete the descriptions is characteristic of the 
concept of stereotype as introduced by Lippman. 
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When one and only one stereotype s, covers by default the fact /, the de- 
scription of /, 5(f), may be completed by the stereotype ,s,; . However, it happens 
that facts may be covered by two or more stereotypes s, . Then, the stereotype 
associated with a fact / is the one that minimizes the measure of disimilarity 
Mjj, i.e. it is the stereotype s, which both covers 5(f) by default and mini- 
mizes Mp (5(f), Si). It is called the relative cover of /, thanks to the measure of 
dissimilarity M D and to a set of stereotypes S = (si, S2, ...s n ). 

Definition 5. In a more formal way, the relative cover of a fact f , with respect 
to a set of stereotype S = (si, S 2 , ■■■s n ), noted Cs(e), is the stereotype s, if and 
only if: 

1. S, G (-51 , S 2 , ■ --Sn) , 

2. M D (5(f),s i )^M 00 , 

3. Mk G [1, n], k ± i,M D (5(f),Si) < M D (5(f), s k ). 

It may also happen that no stereotype covers the new fact f, which means 
that 5(f), the description of f, is inconsistent with all Sj. In this case, there may 
not be any completion and thinking by stereotype is impossible. 

2.6 Extraction of Stereotypes 

Implicit reasoning is formalized here with both the default subsumption and 
sets of stereotypes which structure our memory. Up to now, these sets of stereo- 
types were supposed to be given. This section shows how our memories can 
be organized into sets of stereotypes. In other words, it is to model the way 
facts aggregate into structures which render implicit reasoning possible. From a 
technical point of view, this memory organization process can be seen as a non- 
supervised learning task we call default clustering, which can be summarized as 
follows. 

Being given a set of facts F described with conceptual graphs, a non-super- 
vised learning algorithm is supposed to organize the initial set of facts F into a 
structure, for instance a hierarchy, a lattice or a pyramid. In the present case, we 
restrain to partitions of the training set, which correspond to sets of stereotypes. 
Let us recall that (F\,F 2 , ...Fn) constitutes a partition of the set F if and only 
if: 

1. Mi G [1 ,n],Fi C F 

2 - U i6 [ lin ], Fi = F 

3. M(i,j) G [1, n} 2 , Fi n Fj = 0 

This partition may be generated by n conceptual graphs { 31 , 32 , ■•• 3 n}: it is 
sufficient to associate to each g, the set Fi of facts belonging to F and covered 
by 3^ relative to (31,32, ••• g n )• 

To choose among the numerous possible structures, even with simple struc- 
tures like partitions, a non-supervised algorithm requires a distance. The usual 
way is to minimize the so-called intra-class distance - i.e. the average distance 
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between examples belonging to the same class - and/or to maximize the inter- 
class distance - i.e. the sum of distances between pairs of examples belonging to 
different classes. - The key point is to have a distance among examples of the 
learning set and to extend it to intra and inter-class distances. 

The first distance considered used probabilities. It is very similar to the 
Category Utility measure [12] which is used in the COBWEB system [13] to 
evaluate good partitions. But in practice this measure is not really appropriate 
for sparse descriptions. Moreover, runtime cost was rather high, which made the 
learning algorithm very inefficient. 

Referring to the definition of both “sets of stereotypes” and the relative cover, 
it appears natural to make use of a distance close to the measure of dissimilarity 
Me. This is exactly what we do by introducing a cost function h. based on Mb'- 

Definition 6. F being a set of facts, the so-called training set, S = (s i, S 2 , ...s n ) 
a set of stereotypes and Cs the function that associates to each fact f its relative 
cover, i.e. its closest stereotype with respect to Mb and S, the cost function h is 
defined as follows: 



h(S) = Y,M D (6(f),C s (f)) 

feF 

Once the cost function h has been defined, the non-supervised learning al- 
gorithm has to build the set of stereotypes (si, S 2 , ...s n ) that minimizes h. In 
other words, the non-supervised learning problem is reduced to an optimization 
problem. 

There are several methods for exploring such search space. One is incremental 
and very similar to the one used by Fisher in COBWEB. It starts from an empty 
set with no stereotypes, considers at each step a new individual to be covered 
and updates the set of stereotypes with some specific operators. For instance, 
one of them creates a new stereotype equal to the considered individual; another 
modifies one existing stereotype to cover it; the “merge” operator merges two 
stereotypes belonging to the set of stereotypes and the “split” operator splits 
one active stereotype. The search in COBWEB is a “hill-climbing” strategy; its 
robustness is largely due to these last two specific operators, the “merge” and 
the “split”. However, in case of sparse descriptions and especially with graphs, 
the “merge” and “split” operators cannot be easily implemented. Therefore, it 
is difficult to apply this algorithm here. 

The second option is to search for the best set of stereotypes using general 
optimization techniques. We chose a “tabu” strategy, which is a classical meta- 
heuristic technique used in operational research. It seems quite well adapted to 
solve our problem, as we shall see in the next section. From a technical point of 
view, a neighborhood is calculated from the current solution with the assistance 
of permitted movements. These movements can be of low influence (enrich one 
stereotype with a descriptor, remove a descriptor from another) or of high in- 
fluence (add or retract one stereotype from the current set of stereotypes). As 
in almost all local search techniques, there is a trade-off between exploitation, 
i.e. choosing the best movement, and exploration, i.e. choosing a non optimal 



Modeling Default Induction with Conceptual Structures 



91 



state. The search uses short and long-term memory to avoid loops and to in- 
telligently explore the search space. We shall not detail here the “tabu” search 
algorithm since it is a classical one (see [11]); we shall just evaluate its robustness 
on artificial data in the next section. 

3 Experiments 

This section validates our approach in the Attributes/ Values formalism, before 
proposing a real application dealing with a famous French affair translated into 
conceptual graphs. 

3.1 Validation on Artificial Data Sets 

Evaluation validates on artificial data sets the robustness of the non-supervised 
learning algorithm, which builds sets of stereotypes from a learning set F and 
a description language D. Let us recall that stereotypes are supposed to be 
more or less shared by many people living in the same society. Since use of 
stereotypes is the way to model implicit reasoning, it could explain why prejudges 
and presupposes are almost identical in a group. Our second hypothesis is that 
people reason from sparse descriptions that they are always able to complete and 
to organize into a set of stereotypes in their memory. These two hypotheses entail 
that people, who shared different experiences, and who read different news, are 
able to build very similar sets of stereotypes from very different learning sets. 
Therefore, our attempt to model construction of sets of stereotypes with a non- 
supervised learning algorithm ought to have this stability property. We evaluate 
it here on artificial data. 

Let us now consider the Attributes/ Values formalism. Being given this de- 
scription language, we introduce some full consistent descriptions, e.g. (di, 
d 2 ,ds), which stand for the description of a set of stereotypes. Let us note as 
n s the number of such descriptions. These n s descriptions may be randomly 
generated; the only points are that they need to be full and consistent. 

The second step of the artificial set generation is to duplicate these descrip- 
tion rid times, for instance, 50 times, making n s x iid artificial examples. Then, 
these n s x rid descriptions are arbitrarily degraded: descriptors belonging to 
these duplications are chosen randomly to be destroyed. The only parameter 
is the percentage of degradation, i.e. the ratio of the number of destroyed de- 
scriptors on the total number of descriptors. The generated learning set contains 
n s x rid example descriptions, which all correspond to degradations of the n s 
initial descriptions. 

The default clustering algorithm is tested on these artificially generated and 
degraded learning sets. Then, the stability property is evaluated by weighing the 
set of stereotypes built by the non-supervised algorithm against the n s descrip- 
tions initially given when generating the artificial learning set. 

Our first evaluation consists in comparing the quality -i.e. the percentage of 
descriptors- and the number of generated stereotypes to the initial descriptions, 
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n a , while the percentage of degradation increases from 0% up to 100%. It appears 
that up to more than 85% of degradation, the sets of stereotypes corresponds 
most of the time to the initial ones (see figure 2) . 





Fig. 2. Quality and number of stereotypes discovered. 



The second test counts the classification error rate, i.e. the rate of degraded 
facts that are not covered by the right stereotype. We mean by “right stereotype” 
the discovered description that corresponds to the initial fact the degraded facts 
come from. Fig. 3 shows the results of our program P.R.E.S.S. relatively to 
three classic algorithm for classification: k-means, COBWEB and EM. These 
experiments clearly state that the results of P.R.E.S.S. are really good with a 
very stable learning process: up to 75% of degradation, the error rate is less than 
10% and best as the three others. 
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Fig. 3. Classification error of degraded examples. 
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3.2 Studying Social Misrepresentation 

The application we propose deals with an historic event, the famous miscarriage 
of justice known as the Dreyfus Affair which occurred at the end of the 19th 
century in France. In 1894 Captain Alfred Dreyfus, an officer on the French gen- 
eral staff, was accused of spying for Germany, France’s opponent in the previous 
war. There were many articles about this very complex affair, bringing different 
views depending on the date, recent events, the newspaper political leanings. 
Thus, the liberal pro-Dreyfus Le Siecle expressed opinions which were diamet- 
rically opposed to those of the conservative anti-Dreyfus L ’eclair. The facts we 
considered have been taken from these articles and translated into conceptual 
graphs, in order to build automatically a simplified model of the affair. The ob- 
jective is to understand the influence of the press on the mental representations 
during this period. 

Type hierarchies including 399 concepts and 174 relations were built for this 
specific context. In addition, a typical graph was proposed in order to translate 
the articles into facts. Fig. 4 shows an example of a graph. It represents an article 
from the newspaper L’ eclair using the CoGITaNT library implemented by D. 
Genest and E. Salvat [14]. This library in C++ manipulating conceptual graphs 
was chosen because of the gnu public licence, its great flexibility and the quality 
of the available documentation. 

It could be summarized as follows: the article taken from the newspaper 
L’eclair explicitly asserts that Alfred Dreyfus is guilty because Esterhazy was 




Fig. 4. A conceptual graph which translates a newspaper article. 
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proved innocent by the courts. Once several articles have been translated in this 
way, stereotypes can be discovered using the methods proposed earlier. 

4 Conclusion 

Flows of information play a key role in today’s society. However, the value of 
information depends on its being interpreted correctly, and implicit knowledge 
has a considerable influence on this interpretation. This is the case in many 
today’s heterogeneous databases that are far to be complete and, consequently, 
need special techniques to be automatically completed. This is particularly true 
of the media, such as newspapers, radio and television, where the information 
given is always sparse. 

In this context we propose a cognitive model based on sets of stereotypes 
which summarize facts by “guessing” the missing values. Stereotypes are an alter- 
native to prototypes and are more suitable in the categorization of sparse descrip- 
tions. They rely on the notion of default subsumption which relaxes constraints 
and makes possible the manipulation of such descriptions. Descriptions are then 
completed according to the closest stereotypes, with respect to the dissimilarity 
measure Mr>. Very good results have been found in the Attributs/ Values for- 
malism with artificial data sets. Our interest is now focused on a real application 
using conceptual graphs. 

This work relates to the domain of social representations as introduced by 
Serge Moscovici in [15]. According to him, social representations are a sort of 
“common sense” knowledge which aims at inducing behaviors and allows com- 
munication between individuals. We think that social representations can be 
constructed with the help of sets of stereotypes. The way these representations 
change can be studied through the media over different periods and social groups 
in comparison with such sets. This represents an unexplored way for enriching 
historical and social analysis. 

Finally, this paper emphasizes the danger related to a conceptualization not 
really adapted to a particular problem. Missing data might induce bad interpre- 
tations and lead to erroneous results. This is also what we show with the concept 
of stereotypes and the notion of default subsumption. 
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Abstract. Recent developments in reification of ER schemata include 
automatic generation of web-based database administration systems [1, 
2]. These systems enforce the schema cardinality constraints, but, beyond 
unsatishable schemata, this feature may create unreachable instances. 
We prove sound and complete characterisations of schemata whose in- 
stances satisfy suitable reachability properties; these theorems translate 
into linear algorithms that can be used to prevent the administrator from 
reifying schemata with unreachable instances. 



1 Introduction 

Web-based database administration systems are becoming extremely popular 
as a mean to access database information without the need to deploy a spe- 
cific, architecture-bound client. In particular, recent research focuses on systems 
that are automatically generated from a specification in the language of entity- 
relationship (ER) schemata 1 [1,2]. The administrator specifies a schema in a 
suitable language, and the system generates an SQL implementation and an 
application running server side that presents the user with suitable forms for 
editing the database. 

An important feature of ER schemata is the possibility of specifying cardi- 
nality constraints , that is, constraints on the number of relationships involving 
a certain entity. They are particularly important in the case of content man- 
agement (i.e., database administration for content to be published on the web), 
because relying on the constraints leads to simpler HTML generation (e.g., spe- 
cial cases or missing relationships do not need to be checked). For this reason, 
ER.-based database administration systems enforce cardinality constraints: when 
modifying the database, the user is prevented from moving the database form a 
legal configuration to an illegal one. 

Cardinality constraints, however, cannot be imposed lightly, as they can eas- 
ily lead to problems when the schema is poorly designed. The first systematic 
study of this issues can be found in a seminal paper by Lenzerini and Nobili [5], 
which gave necessary and sufficient conditions for a schema to have an instance 
at all (a condition called satisfiability), or to have non-empty instances ( strong 

1 ER schemata are a popular conceptual model originally introduced by Chen [3] , and 
later extended in several ways [4]. 

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 96-109, 2004. 
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satisfiability) . This is the first, necessary step to ensure that the schema defini- 
tion has any sense. Later, the results were further extended to schemata with 
ISA [6], albeit the extension did not take into consideration type features, such 
as disjunctive subtyping. 

Even if a schema has non-empty instances, however, there is another or- 
thogonal problem: modifications to the schema instance (i.e., to a database) are 
usually constrained (e.g., a user can modify just a subset of the instance in a 
transaction), and thus the existing instances may be “too far” to make the user 
actually able to modify one into another. 

A typical case is one in which we have entity types E and F and mandatory 
relationship types R from E to F and S from F to E. In this case, there is no 
way to move from the empty instance (without any entity or relationship) to any 
nonempty instance using only local modifications, that is, inserting, deleting or 
updating at most one entity and its relationships at a time (as indeed happens 
in web-based content management). In other word, the schema happily passes 
all static satisfiability conditions, but it is in practice unusable. 

The purpose of this paper is to present sound and complete algorithms that 
statically check an ER schema for instance reachability conditions. In particular, 
we present an algorithm guaranteeing that all instances are mutually reachable, 
and an algorithm guaranteeing that all everywhere nonempty instances are mu- 
tually reachable. The second case is useful when a pre-populated database needs 
very stringent constraints that do not pass the first check (as it happens in the 
previous example). These results complements the static satisfiability aforemen- 
tioned results: all instances of a schema with no instance at all, for instance, 
are by definition mutually reachable, so satisfiability tests are necessary, but, as 
shown in our previous examples, instances of fully satishable schemata may be 
mutually unreachable. 

To make our reasoning precise, we need a completely formal definition and 
semantics of ER schemata that refers only to sets, elements and functions (as 
proposed, for instance, in [7]); as a by-product, our results are valid indepen- 
dently of the kind of DBMS that is used to implement the schema semantics 2 . 

The main motivation for this work is ERW (http://erw.dsi.unimi.it/), 
a reification tool developed by the author that creates a complete web-based 
database administration system starting from a schema definition. ERW sup- 
ports sophisticated features such as weak entities, multiple inheritance, user 
authentication and authorisation, etc. It has been in use for the last two years 
to manage the web site of the Computer Science Department of the Universita 
degli Studi di Milano; recently, other departments and universities in Italy have 
refactored their databases to use ERW. Since its public release in February 2002 
there have been more than 10 000 accesses to the ERW home page. 



2 In general, ER-based management systems should be based on a formal, abstract 
semantics of schema instances and transactions so that different database models 
(i.e., relational or object-oriented) and different DBMS can be used to implement 
the abstract semantics. 
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ERW is based on the abstract semantics described in this paper. In partic- 
ular, before creating the actual database administration system it performs a 
number of checks on the validity of the schema. One of the algorithms solves 
the problem of double ownership , that is, the problem of identifying weak enti- 
ties, and has been described in [2]. In this paper, we describe the techniques by 
which we prevent administrators from creating databases that can lock the user 
in unmodifiable configurations. 

No published solution to these problems is known to the author. To be true, 
only a small part of the work on conceptual modeling is formalised mathemat- 
ically, and this makes it difficult, if not impossible, to develop algorithms and 
prove their correctness. Moreover, the problem of reachability was born with 
stateless web interactions: the usual notion of database transaction is stateful, 
so reachability is not a problem in that case. 

Note that general techniques for satisfiability problems in first-order (but 
possibly more general than entity-relationship schemata) theories are useless 
here, because we the problem we study is inherently of higher order — we want 
to prove statements about the models of the theory. 

First, we introduce briefly the semantics for schemata with binary relation- 
ship types that we proposed in [2] ; it extends the original one given by Chen [3] , 
but adds typing and inheritance (the original ER proposal did not include ex- 
plicit subtyping information). Moreover, it introduces multirelations as the ab- 
stract semantics of relationship types (with respect to the formulation given 
in [2], we add a new notion, that of abstract entity types , motivated partly by 
ontological reasons and partly by the need of representing complete disjunctive 
types). 

Once the semantics is set up, we introduce a notion of isomorphism of in- 
stances that allows us to define precisely when two schema instances are equiv- 
alent modulo the particular elements used in representing them, and of local 
modification , which models stateless interaction with a web client. 

At this point, we discuss two notions of reachability and characterise them 
mathematically in a sound and complete way. From the characterisation it is im- 
mediate to derive linear check algorithms. However, the characterisation is given 
for schemata without ISA relationship types (i.e., without subtyping). Thus, we 
conclude discussing some techniques that should be applied when subtyping is 
present. We also discuss extensions to n-ary relationship types. 

A Java implementation of the algorithms described in this paper and in [2] 
is available as free software by the author at the URI above. 

2 Schemata 

We note first that we do not need to introduce attributes in our schema definition. 
Since we have to discuss just cardinality issues, it is sufficient to be able to 
speak of sets and elements (one can of course add an attribute map for every set 
involved and easily discuss keys, attribute constraints and so on). Moreover, for 
simplicity we present the algorithm for binary relationship types, and discuss at 
the end of the paper the (easy) extension for types of arbitrary arity. 

An EER schema (of binary relations) 5? is given by a set $ of entity types, 
a set ^ of relationship types, a source function s : & — » & and a target function 
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t : AS — > $ (note that in order to give a formal semantics to an EER schema 
without roles you need to consider relationship types as directed). An entity type 
may be abstract. Moreover, each relationship type has a source and a target 
cardinality constraint , which is a symbol out of (0:1), (1:1), (0:N), (1:N), 
(0:M), (1 :M). The ordered pair of cardinality constraint of a relationship type 
is usually written as (-:-)—>(-:-). Finally, a relationship type may be marked 
optionally either ISA, in which case its constraint must be (1:1)— >(0:1) 3 . 

Whenever there is an ISA relationship type from E to F, E is said to be a 
direct subtype of F, and F a direct supertype of E. A subtype of E is either E or 
a direct subtype of a subtype of E (analogously for supertypes). 

Note that there are two new cardinality types: (0:M) and (1:M). The M 
indicates that we are actually requiring a multirelation. 

The notion of abstract entity type is a new extension with respect to the 
definition presented in [2], and it is borrowed from object-oriented languages 
(notably Java). The idea is that it should be used for types which are necessary 
for a correct structuring of the type hierarchy, but that are “universals” (in 
the ontological sense) and thus have no instance themselves — they can be just 
instantiated through their subtypes. We shall see that this extension allows one 
to implement some additional type constructs (at the price of a more complex 
interaction between types and constraints). 

3 Instances 

To define a schema instance, one introduces multirelations in the spirit of bicat- 
egory theory [8, 9]: 

Definition 1. A (binary) multirelation from set X to set Y is a set M endowed 
with two functions, the left leg Mq and the right leg Mi : 



M 




X Y 

Two elements x £ X and y £ Y are related if there is an r £ M such that 
Mo(r) = x and M\(r) = y, a condition that we write in short with the notation 
r( x, y). The reader should notice that it can happen that there are two elements 
r,s £ M such that r(x,y) and s(x,y). In this case, the elements x and y are 
related “more than once”. If this never happens, then M is a standard relation 
represented in tabular form (if you start from a relation as a subset of X x Y 
the two legs are just the projections). 

Definition 2. An instance a for a schema AX is given by a map a assigning 
to each entity type E in S’ a set a (E) and to each relationship type R in AS a 
multirelation cr(R) satisfying the following properties: 

3 For completeness, we should introduce the WEAK labelling for arcs that represent 
identification functions, but this has no effect on the present discussion. 
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1. The left leg of <r(R) must end in a(s(E)), and the right leg in a(t(E)); in 
other words, if R is a relationship type from entity type E to entity type F , 
then cr(R) must be a multirelation 

a(R) 



cr(E) cr(F) 

that is, a multirelation with a left leg ending in cr{E) and a right leg ending 
in cr(F). 

2. A cardinality constraint of the form (1:-) requires that the corresponding 
leg be a surjective function. 

3. A cardinality constraint of the form (~:1) requires that the corresponding 
leg be an injective function. 

4- A cardinality constraint of the form (~:N) requires that the multirelation is 
actually a relation (i.e., nothing is related twice). 

5. Whenever a relationship type is marked ISA. the first leg is the identity and 
the second leg is an inclusion map, so that that source entity set is a subset 
of the target entity set. 

6. Whenever entity types E and F have a common supertype and x £ cr(E) n 
a (F), there is a common subtype G of E and F such that x £ er(G) 4 . 

7. If E is abstract, a(E) is exactly the union of cr(F) when F ranges through 
the proper subtypes of E. 

The wording of cardinality constraints may seem a bit unorthodox: however, 
it is easy to see that it is exactly equivalent to the standard participation inter- 
pretation of constraints, and moreover extends immediately to n-ary relationship 
types. 

Condition (6) is important as it forces definite typing. Essentially, every entity 
must have a type: if, for instance, man and woman are subtypes of person, it is 
not possible that x belongs to cr(man) and er(woman) (unless you add a common 
subtype hermaphrodite of both man and woman). 

An entity is now a pair (E, x) such that E is the type of x. Note that we did 
not restrain entity sets to be disjoint, so an x whose type is E and an x whose 
type is F are actually distinct entities 5 . 

Finally, condition (7) gives a precise semantics to abstract entity types: their 
instances are all subtype instances. In particular, an abstract entity type without 
subtypes cannot have instances. 

Abstract types are useful as they allow one to model complete disjunctive 
subtyping. Getting back to the example above, by making person abstract we 
force every person to be either a man or a woman. 

4 The last two conditions are an elementary phrasing of a stability condition on suit- 
ably defined maps — see [2]. 

5 This feature parallels the common usage of numerical identifiers to represent set 
elements in SQL databases — the identifiers are not necessarily distinct in different 
tables. 
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4 Local Modifications 

To formalise reachability problems, we introduce local modifications, which par- 
allel the action of a user interacting with a database by means of a web browser 
(and thus, essentially, statelessly). 

Definition 3. A local modification of an instance cr of the schema df is one of 
the following operations: 

1. adding an entity e of type E, together with relationships involving e and 
another entity; 

2. adding a relationship; 

3. deleting an entity e of type E and all relationships involving e; 

4- deleting a relationship. 

If t is derived by a by means of a local modification, we write a — > t. We write 
a => t if there is a chain 



cr = v o 



v\ 



= T 



of instances. 

Note that the previous definition is targeted to the applications we have in 
mind: stateless interaction with a server does not allow to access data stored 
client-side until there is a commit (if we are interacting to create e we cannot 
access e as if it was already on the server) . This is the reason why the notion of 
addition and of deletion of an entity are not symmetric. 



5 Isomorphism of Instances 

To discuss reachability problems, it is important to identify instances which differ 
only for the set-theoretical identity of their elements, but for not their relational 
structure, as shown in the example of Fig. 1. 




Fig. 1 . Isomorphic instances. 
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Definition 4. We say that instances a and t of a schema S? are isomorphic 
if we have for each entity type E G S and for each non-ISA relationship type 
R G 2% bijections le '■ <x(E) — > t(E) and l r '■ cr(R) — > t(R) such that 

r(x, y) G cr(R) <=> i R (r)(L s{R) (x),L t (R)(y)) G t(R). 

The definition above could be rephrased by saying that the entity and rela- 
tionship sets must be in bijection, and the bijections must commute with the 
multirelations (more precisely, with their legs) 6 . In particular, we notice that ev- 
ery instance in which all entity and relationship sets contain at most one element 
is unique up to isomorphism. 



6 Full Reachability 

We can now formalise our problem in its most general form: 

Problem 1. Given a class of instances of the schema S? ’, is it true that for 
every pair of instances a, r G 'S we have, up to isomorphism, a =>■ r? 

This question is important: if, for instance, there is no path out of the empty 
instance, then it is impossible to populate the instance by means of local mod- 
ifications, that is, using a web-based database administration system. As we 
discussed in the introduction, if we have two entity types E and F, and two 
relationship types R from E to F and S from F to E , and both R and S have 
constraints (1:N)— >(0:N), then no local modification is possible on the empty 
instance (as adding an entity to E or F would violate a cardinality constraint). 
However, in this case, it is easy to see that all nonempty instances are mutually 
reachable. 

The situation is very different if we instead assume (1:1)— >(1:N) as con- 
straint of R and S. In this case, all instances are mutually unreachable. This 
happens because every instance contains a cycle of surjective functions, and a 
trivial counting argument shows that all entity sets along such a cycle must have 
the same cardinality (assuming, of course, the entity sets are finite) . This forbids 
any local modification. Clearly we want to avoid this case. 

To give our results, we set up a few terms that will be useful: a schema is 
said to be flat if it contains neither ISA relationship types nor abstract entity 
types; an endorelationship of an entity e is a relationship of the form r(e, e). 

Definition 5. Given a flat schema 5S , the graph rfS?) is defined as follows: 
the set of nodes of r(lF) is the set of entity types S, and there is an arc from E 
to F whenever there is a relationship type with constraint of the form (—:—)—> 
(1 : — ) from E to F or a relationship type with constraint of the form (1 : — ) — •> 
(— : — ) from F to E. 



The notion of instance can be easily seen to be a special case of the notion of 
pseudo-functor between bicategories [8, 10]; the notion of isomorphism above is just 
an elementary restatement of the definition of functor isomorphism. 




Reachability Problems in Entity-Relationship Schema Instances 103 



The graph r(jT’) embodies the mandatoriness constraints in a form that 
makes it easy to check whether they are contradictory: 

Theorem 1. Let 5^ be a flat schema. Then all instances of SA are mutually 
reachable if and only if r{S^) is acyclic (i.e., it contains no cycles). 

Proof. We start by proving the right-to-left implication. It is sufficient to show 
that from an instance a we can reach the empty instance by means of reversible 
local modifications. To do so, it is sufficient to show that we can always reversibly 
delete an entity from any given instance. 

The only local modification that is not reversible in general is deletion of an 
entity e that has endorelationslrips. However, since by hypothesis the type of 
those endorelationslrips cannot be mandatory (or there would be a loop), we can 
obtain the same result by first deleting all endorelationslrips, and then finally 
deleting e (the last step is reversible, as all relationships at this point involve 
some other entity). 

Consider now a nonempty instance er, and an entity e of type E of a. Then, 
either e can be deleted without breaking any cardinality constraint, or there must 
be a relationship type, say from E to F E with constraint (—:—)—> (1 : — ) 
satisfied by e (the dual case of a relationship type from F to E with constraint 
(1 : — ) — > (— : — ) can be treated analogously). Thus, there must be an arc of 
r(S fi ) from E to F, and an entity / of type F which is in relation with e. We 
can iterate this procedure on /, but since contains no cycle after a finite 

number of steps we must get to an entity that can be deleted. 

For the other implication, consider a cycle in r(A^), to which we can associate 
a sequence Eq, Ro , -Ed, Ri, • ••, Rn, Eq of mandatory relationship types 7 . If 
n > 0, given an instance a such that <j(Ef) = 0 for all i, there is no local 
modification that can change this condition, as adding an entity of any of the 
cr (Ed)’ s would imply adding at the same time other n entities, which is not 
allowed by the definition of local modification. If n = 0, adding an entity would 
imply adding at the same time an endorelationslrip or other entities, which again 
is not allowed by the definition of local modification. 

7 Almost Full Reachability 

There are situations in which the previous check is too strict (for instance, if 
we start from a pre-populated database, which is never meant to be empty). A 
reasonable relaxing of full reachability is requiring that all instances in which all 
entity sets are nonempty are mutually reachable. We give a formal definition of 
this fact: 

Definition 6. A schema instance a is everywhere nonempty if for every E £ § 
we have a (E) ^ 0, that is, if all its entity sets are nonempty. 

7 To simplify the proof, we assume without loss of generality that all mandatory types 
involved are “in the right direction” . 
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In this case, we have to look for injective mandatory relations. A relationship 
that is injective and mandatory, that is, such that every entity is in relation with 
some other entity and such that two entities are never related to the same entity, 
imposes a cardinality bound on the size of the source and target entity set (the 
first one must be smaller than or equal to the second one) . If we get a cycle of such 
constraints, we may be in trouble (the fact that cycles of injective relations can 
be substituted with cycles of bijections without changing the allowable instances 
was already noticed in [4]). 

Again, we build a graph that embodies the combinatorial bounds imposed 
by cardinality constraints: 

Definition 7. Given a flat schema S? , we define the graph A(S*) as follows: the 
set of nodes of A(S*) is the set of entity types $ , and there is an arc from E to F 
whenever there is a relationship type with constraint of the form (— : 1) — > (1 : — ) 
from E to F or a relationship type with constraint of the form (1 :—)—>(— : 1) 
from F to E. 

Theorem 2. Let 5? be aflat schema. Then all everywhere nonempty instances 
of 5? are mutually reachable if and only if A (S') contains no cycles. 

Proof. Analogously to the proof of Theorem 1, for the right-to-left implication 
we show that from an instance er we can reach a fixed instance (up to iso- 
morphism) having exactly one entity per entity type and one relationship per 
relationship type. Such an instance certainly satisfies all cardinality constraints 
(as all multirelations are bijections). 

We notice again that all entity deletions performed in the proof will be re- 
versible. Indeed, if we have to delete e € cr(E), with |cr(£7) | > 1, and e is involved 
in endorelationslrips, we can certainly either delete them (because their type is 
not mandatory) , or modify them so to relate to some other element of the same 
type of e (because they are not injective); otherwise, the types of those relation- 
ships would generate a loop in A(y'). 

Consider now an everywhere nonempty instance a, an entity set cr(F) with 
more than one element and an entity e of type E. If e can be deleted without 
breaking any cardinality constraint, we are done. If e cannot be deleted, this must 
happen because it is the only entity associated to one or more other entities fo, 
/i, . . . , f n of type Fo, F\, . . . , F n and there are mandatory relationship types 
Ro, R\, . . • , R n from each of Fo, F\, . . . , F n to E, satisfied by association with 
e. 

However, if these relationship types are not injective (i.e. , if their cardinality 
constraints are not of the form (—:—)—>(—: 1), then we can first make a finite 
number of modifications associating /o, /i, ■•■,/» to a different element of E, 
and then delete e. Otherwise, we take an element /) of type F t , where Ri has a 
constraint of the form (1 :—)—>(—: 1), and iterate the above operations (note 
that <r(Fj) must contain more than one element, or we could delete e without 
breaking any constraint of Ri). In doing so, we have followed an arc of A(S fi ), 
so we can iterate a finite number of times. At the last step, we obtain an entity 
that can be deleted. 
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For the other implication, consider a cycle in A(S fi ), to which we can associate 
a sequence E 0 , R 0 , Ei , R \ , . . . , R n , Eq of mandatory injective relationship types. 
If n > 0, given an instance a such that |<r(.Ej)| = k > 0 for all i, there is no 
local modification that can change this condition, as adding an entity to any of 
the u(£ , i)’s would imply either adding a relationship from e to itself (if n = 0) 
or otherwise adding at the same time other n > 0 entities. Indeed, e must be 
associated in R 0 to some other entity, but we cannot use any other previously 
existing entities, or we would violate injectivity (since all cr(£'i)’s have the same 
cardinalities, the a (Rifs are all bijections). 



8 Arbitrary Arity 

In general, one would like to handle relationship types of arbitrary arity. To 
this purpose we slightly (but naturally) extend the definition of schema and 
instances. 

Definition 8. A (n-ary) multirelation between the sets Xq, X\, . . . , X n -\ is a 
set M endowed with n functions, the legs, Mq, Mi, . . . , M n - 1 




We can now easily extend the definition of schema by allowing n-ary rela- 
tionship types in the obvious way, and by giving their semantics using n-ary 
multirelations. 

Definition 9. An n-ary schema is given by a set of entity types $ and a set of 
relationship types Each type R € 2% is endowed with a list Eq, E\, . . . , E n -\ 
of n > 2 component types and a list of n cardinality constraints. In this case, 
we say that R is n-ary, or of arity n. 

Extending corresponding the definition of instances is now trivial. Note that 
the definition of cardinality constraints remains valid also in this case. 

To extend our results to n-ary schemata, all we need to do is to rephrase 
correctly the construction of the graphs T{AX) and A(S fi ). At that point it is not 
difficult to show that Theorem 1 and Theorem 6 remain true. 

Theorem 3. Given an n-ary schema AX , we define the graph r(dX) as follows: 
the set of nodes of A(J7*) is the set of entity types S , and there is an arc from F 
to G whenever there is a relationship type R with component types Eq, Ei, . . . , 
E n _ i, indices i ^ j such that E, = F, Ej = G, and the j-th constraint is of 
the form (1:-). Then, all instances of A/ are mutually reachable if and only if 
r(S fi ) is acyclic. 
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The construction above puts an arc in A(5A~) from F to G whenever F and G 
are connected by a relationship type in such a way that to each entity of type G 
we must associate an entity of type F . In particular, this means that you cannot 
have a ternary relationship type that is mandatory on two component types (as 
you would create a cycle of length two) . 

Note also the careful wording of the definition: an apparently similar defini- 
tion claiming that we should insert an arc whenever there is a relationship type 
R whose component type include F and G , where the constraint of G is the form 
(1 : - would erroneously include the case that F is equal to G and in the same 
position in the component list. This of course is wrong, as we would add an an 
arc from E to E for a relationship type with list of component types E, F which 
is mandatory in E (the only correct one would be from F to E). The position 
of a type in the component list act as a role ; roles are usually specified using 
identifiers, but from a mathematical viewpoint it is much more manageable to 
use the position in the component list (this is also the approach taken in [4]). 

Analogously, we can give a suitable redefinition of A(J?): 

Theorem 4. Given an n-ary schema S? , we define the graph A(J?) as follows: 
the set of nodes of A{ SA') is the set of entity types § , and there is an arc from F 
to G whenever there is a relationship type R with component types E$, E\, . . . , 
E n _ i, indices i =£ j such that Ei = F, Ej = G, the i-th constraint is of the 
form (~:1), and the j-th constraint is of the form (1:-). Then, all everywhere 
nonempty instances of AF are mutually reachable if and only if A(J?) is acyclic. 

In this case, we put from F to G whenever F and G are connected by a 
relationship type in such a way that to each entity of type G we must associate 
a distinct entity of type F. 



9 Considerations on Subtypes 

In the previous sections valid and complete checks for full and almost full reach- 
ability were provided. Turning these checks into algorithms is of course trivial, 
as cycle detection is linear. However, the theorem were given for schemata with- 
out subtyping. When subtyping is taken into consideration, the combinatorics 
of the problem becomes much more entangled, as we will try to explain using 
a few examples, and obtaining sound and complete results becomes much more 
difficult. 

First of all, there is no canonical everywhere nonempty instance. The heart 
of the proof of Theorem 6 is that we have a unique (up to isomorphism) simple 
instance to work with. If subtypes are present, this is no longer true. 

Consider the schema shown in Fig. 2. If there are k entities in D , there must 
be at least k entities in both A and B. But this means that there must be at 
least 2k entities in C (because of definite typing) and thus at least 2k elements 
in D (because of T). This set of constraints reduces to k < 0, and thus it is 
satisfied by the empty instance only. 
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Fig. 2. A diagram without nonempty instances. 



If we eliminate T, things get better, but it is easy to see that there is no 
everywhere nonempty canonical instance, as if we put one entity in B and C we 
are forced to put two in A ; more complex type hierarchies may create extremely 
entangled combinatorial constraints. 

One could think that at least the left-to-right implications of Theorems 1 
and 6 should continue to hold (of course, all ISA relationship types will end up 
in both and A(S fi )). However, there are certainly situations in which this 

is not true. Consider the diagram in Fig. 3, where dashed entities are abstract. 
It would pass all of our checks. Nonetheless, an instance in which every type has 
exactly one element (i.e. , cr(A) = a(B) = cr(C') = {a;}, cr(E) = <r(F) = a(G) = 
{y}, &{T) = { t(y , x) } and a(S) = { s(x, y) }) has no legal modification. 

The point is that relationship types are inherited. Since C is a subtype of A, 
also elements of type C may participate to a relationship of type T. Moreover, 
since A is abstract, each entity of type A must also be of type C . All in all, we 
get a cycle analogous to the ones used in the proof of Theorem 6. 

Abstractness here plays a fundamental role: should not A be abstract, we 
could add an entity z to A , change the relationship t(y,x) to t(y,z), delete x 
and its adjacent relationship s(x, y), delete y and its adjacent relationship t(y, z), 
and finally z, getting to the empty instance. 

Indeed, cycles of this kind can be built only using forced subtypes ( E is a 
forced subtype of F if every entity of type F is also of type E). For a graph- 




Fig. 3. An apparently innocuous diagram fragment. 
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theoretical viewpoint, an entity type £ is a forced subtype of F if and only 
if E is the only minimal (i.e., without proper subtypes) subtype of F and all 
sequences of ISA relationship types going from E to F traverse abstract entity 
types only (including F, but except possibly for E). 

Forced subtyping is important for reachability because cycles of mandatory 
(and possibly injective) inherited relationships can usually be broken by inserting 
suitable entities, as we pointed out in our last example, but this does not happen 
if inheritance (on the non-mandatory side of the relationship type) is by forced 
subtypes. 

Nonetheless, forced subtyping is a pathological condition that should not 
appear in a schema as much as abstract entity types without subtypes. Both 
pathologies can be filtered before performing the acyclicity check; under the 
hypothesis that no forced subtype exists, one can prove the following one-sided 
theorem, using the same techniques of Theorem 1 and 6: 

Theorem 5. If all instances of a schema 5? without forced subtypes are mutu- 
ally reachable, then r(S?) is acyclic. If all everywhere nonempty instances are 
mutually reachable, then A(S*) is acyclic. 

10 Conclusions 

We have presented the first sound and complete algorithms to check instance 
reachability in entity-relationship schemata. Since the algorithms are acyclicity 
tests on a graph whose size in linearly bounded by the schema size, they are 
linear by definition. 

It would be interesting to extend these results to schemata with ISA arcs 
and a type system as described in Definition 2, or, at least, for type constructors 
with disjunctive types (albeit the former definition is more general). It would be 
more cautious, however, to start first with an extension in this direction of the 
results given in [5], as the problem seems to be already tough enough in that 
case. 
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Abstract. The ontological analysis of conceptual modelling techniques is of in- 
creasing popularity. Related research did not only explore the ontological defi- 
ciencies of classical techniques such as ER or UML, but also business process 
modelling techniques such as ARIS or even Web services standards such as 
BPEL4WS. While the selected ontologies are reasonably mature, it is the actual 
process of an ontological analysis that still lacks rigor. The current procedure 
leaves significant room for individual interpretations and is one reason for criti- 
cism of the entire ontological analysis. This paper proposes a procedural model 
for the ontological analysis based on the use of meta models, the involvement 
of more than one coder and metrics. This model is explained with examples 
from various ontological analyses. 



1 Popularity of Ontological Analyses 

As techniques for conceptual modelling, enterprise modelling, and business process 
modelling have proliferated over the years (e.g., [12]), researchers and practitioners 
alike have attempted to determine objective bases on which to compare, evaluate, and 
determine when to use these different techniques (e.g., [4, 11]). Throughout the 80's, 
90's, and into the new millennium however, it has become increasingly apparent to 
many researchers that without a theoretical foundation on which to base the specifica- 
tion for these various modelling techniques, incomplete evaluative frameworks of 
factors, features, and facets would continue to proliferate. Furthermore, without a 
theoretical foundation, one framework of factors, features, or facets is as justifiable as 
another for use (e.g., [2]). 

Wand and Weber [19-23] have investigated the branch of philosophy known as on- 
tology as a foundation for understanding the process in developing an information 
system. Ontology is a well-established theoretical domain within philosophy dealing 
with identifying and understanding elements of the real world. Today however, inter- 
est in, and the applicability of, ontologies extends to areas far beyond modelling. As 
Gruninger and Lee [9, p. 1 3] point out, “...a Web search engine will return over 
64,000 pages given “ontology” as a key word... the first few pages are phrases such as 
“enabling virtual business”, “gene ontology consortium, and “enterprise ontology”.” 
The usefulness of ontology as a theoretical foundation for knowledge representation 
and natural language processing is a fervently debated topic at the present time in the 
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artificial intelligence research community [10]. The popularity of using ontologies as 
a basis for the analysis of techniques that purport to assist analysts to develop models 
that emulate portions of the real world has been growing steadily. The Bunge-Wand- 
Weber (BWW) ontological models [24], for example, have been applied extensively 
in the context of the analysis of various modelling techniques. Wand and Weber [19- 
23] and Weber [24] have applied the BWW representation model to the “classical” de- 
scriptions of entity-relationship (ER) modelling and logical data flow diagramming 
(LDFD). Weber and Zhang [25] also examined the Nijssen Information Analysis Method 
(NIAM) using the ontology. Green [5] extended the work of Weber and Zhang [25] and 
Wand and Weber [22-23] by analysing various modelling techniques as they have been 
extended and implemented in upper CASE tools. Furthermore, Parsons and Wand [15] 
proposed an initial model of objects and they use the ontological models to identify rep- 
resentation-oriented characteristics of objects. Along similar lines, Opdahl and Hender- 
son-Sellers [13] have used the BWW representation model to examine the individual 
modelling constructs within the OPEN Modelling Language (OML) version 1.1 based 
on “conventional” object-oriented constructs. Green and Rosemann [6] have extended 
the analytical work into the area of integrated process modelling based on the tech- 
niques presented in Scheer [17]. Most recently. Green el al. [8] have extended the use 
of this evaluative base into the area of enterprise systems interoperability using busi- 
ness process modelling languages like ebXML, BPML, BPEL4WS, and WSC1. 

Clearly, ontology is a fruitful theoretical basis on which to perform such analyses. 
However, while ontological analyses are frequently utilised, particularly in the area of 
conceptual modelling technique analysis, the actual process of performing the analy- 
sis remains problematic. The current process of ontological analysis is open to the 
individual interpretations of the researchers who undertake the analysis. Conse- 
quently, such analyses are criticised as being subjective, ad hoc , and lacking in rele- 
vance. There is a need, therefore, for the systematic identification of shortcomings of 
the current ontological analysis process. The identification of such weaknesses, and 
their subsequent mitigation, will lead to a more rigorous, objective, and replicable 
analytical process. 

Accordingly, this paper has several objectives. First, we aim to identify compre- 
hensively the shortcomings in the current practice of ontological analysis. The identi- 
fication of such shortcomings will provide a basis upon which the practice of onto- 
logical analysis can be improved. Second, we want to develop several propositions 
and methodology extensions that enhance the ontological analysis process by making 
it more objective and structured. 

There are several contributions that this paper aims to make. They are based on 
previous experiences with ontological analyses as well as observations derived from 
published analyses. First, the work presents a detailed analysis of the actual process of 
performing an ontological evaluation. The presented work identifies eight shortcom- 
ings of the current ontological analysis process, viz., lack of understandability, lack of 
comparability, lack of completeness, lack of guidance, lack of objectivity, lack of 
adequate result representation, lack of result classification, lack of relevance. Each of 
the identified shortcomings is classified then as belonging to one of three phases of 
analysis, viz., input, process, and output. Second, the paper presents recommendations 
on how each of the shortcomings in the three phases can be overcome. The recom- 
mendations, inter alia, include an extended methodology for the improvement of the 
objectivity of the analysis as well as a weighting model that aims to improve the clas- 
sification of the results of any ontological analysis. 
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The remainder of this paper is structured as follows. The next section identifies 
eight current shortcomings of ontological analyses that are classified with respect to 
the three phases of analysis, viz., input, process and output. The third section provides 
recommendations concerning how to overcome the identified shortcomings in each of 
the three phases. The final section provides a brief summary of this work and outlines 
future research in this area. 



2 Shortcomings of Current Ontological Analyses 

An ontological analysis is in principle the evaluation of a selected modelling grammar 
from the viewpoint of a pre-defined and well-established ontology. The current focus 
of ontological analyses is on the bi-directional comparison of ontological constructs 
with the elements of the modelling grammar that is under analysis. Weber [24] clari- 
fies two major situations that may occur when a grammar is analysed according to an 
ontology. After a particular grammar has been analysed, predictions on the modelling 
strengths and weaknesses of the grammar can be made according to whether some or 
any of the following situations arise out of the analysis. 

1. Ontological Incompleteness (or Construct Deficit) exists unless there is at least 
one grammatical construct for each ontological construct. 

2. Ontological Clarity is determined by the extent to which the grammar does not 
exhibit one or more of the following deficiencies: 

• Construct Overload exists in a grammar if one grammatical construct represents 
more than one ontological construct. 

• Construct Redundancy exists if more than one grammatical construct represents 
the same ontological construct. 

• Construct Excess exists in a grammar when a grammatical construct is present 
that does not map to any ontological construct. 

Though this type of ontological analyses is widely established, it still has a range 
of shortcomings. These shortcomings can be categorised into the three main phases of 
an ontological analysis, i.e. preparation of the input data, the process of conducting 
the analysis, and the evaluation and interpretation of the results. 

The first two identified shortcomings refer to the quality of the input data. 

2.1 Lack of Understandability 

Most of the ontologies that are currently used for analysis of modelling grammars 
have been specified in formal languages. While such a formalisation is beneficial for 
a complete and precise specification of the ontology, it is not naturally a very intuitive 
specification. An ontology that is not clear and intuitive can lead to misinterpretations 
as the involved stakeholders have problems with the specifications. Furthermore, it 
forms a hurdle for the application of the ontology as it requires a deep understanding 
of the formal language in which it is specified. 

2.2 Lack of Comparability 

The specification of an ontology requires typically a formal syntax, which allows the 
precise specification of the elements and relationships of the ontology. Such specifi- 
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cations are required, but not necessarily intuitive. Consequently, textual descriptions 
of the ontology in ‘plain English’ often extend the formal specification. 

However, even if an ontology is specified in an intuitive and understandable lan- 
guage, the actual comparison with the selected modelling grammar remains a prob- 
lem. Unless the ontology and the grammar are specified in the same language, it will 
be up to the coder to ‘mentally convert’ the two specifications into each other, which 
adds a subjective element to the analysis. Different languages can also lead to differ- 
ent levels of detail and further complicate the analysis. In any case, they make a more 
automated comparison practically impossible. This is the typical situation in nearly ah 
previous analyses. 

The further three shortcomings identified below are related to the process of the 
ontological analysis and refer to what should be analysed, how it should be analysed 
as well as who should conduct the analysis. 

2.3 Lack of Completeness 

The first decision that has to be made in the process of an ontological analysis is on 
the scope and depth of the analysis. Even if most ontologies have been discussed for 
many decades they still undergo modifications and extensions. It is up to the re- 
searcher to clearly specify the selected version of the ontology and the scope and level 
of detail of the analysis. In our work in the area of Web Services, for example, it was 
often not clear what constructs form the core of the standard. Two researchers who 
conducted independent analyses of the same Web Services standard, selected conse- 
quently a different number of constructs. 

Moreover, many ontological analyses solely focus on the constructs of the ontol- 
ogy and the constructs of the grammar but do not sufficiently consider the relation- 
ships between these constructs. The difficulty in clearly specifying the boundaries of 
the analysis as well as the limited consideration of relationships between the ontologi- 
cal constructs lead to a lack of completeness. 

2.4 Lack of Guidance 

After the scope and the level of detail of the analysis have been specified, it is typi- 
cally up to the coder to decide on the procedure of the analysis, i.e. in what sequence 
will the ontological constructs and relationships be analysed? Currently, there are 
hardly any recommendations on where to start the analysis. This lack of procedural 
clarity underlies most analyses and has two consequences. First, a novice analyst 
lacks guidance in the process of conducting the ontological evaluation. Second, the 
procedure of the analysis can potentially have an impact on the results of the analysis. 
Thus, it is possible that two analyses that follow a different process may lead to dif- 
ferent outcomes. 

2.5 Lack of Objectivity 

An ontological analysis of a grammar requires not only detailed knowledge of the 
selected ontology and grammar but also a good understanding of the languages in 
which the ontology and the grammar are specified. This requirement explains why 
most analyses are carried out by single researchers as opposed to research teams. 
Consequently, these analyses are based on the individual interpretations of the in- 
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volved researcher, which adds significant subjectivity to the results. This problem is 
further compounded by the fact that, unlike other qualitative research projects, onto- 
logical analyses typically do not include attempts to further validate the results. 

The five shortcomings identified above have a common flavour in that they heavily 
depend on the researcher conducting the ontological evaluation. Three further short- 
comings have been identified, viz-, lack of result representation, lack of result classifi- 
cation and lack of relevance. These shortcomings are detailed below and refer to the 
outcomes of the analysis. 

2.6 Lack of Adequate Result Representation 

The results of a complete ontological analysis, i.e. representation mapping and inter- 
pretation mapping, are typically summarised in two tables. These tables list all onto- 
logical constructs (first table) and all grammatical constructs (second table) and the 
corresponding constructs of the other meta model. Such tables can become quite 
lengthy and are typically not sorted in any particular order. They don’t provide any 
insights into the importance of identified deficiencies and they also don’t cluster the 
findings. 

2.7 Lack of Result Classification 

As indicated above, it is common practice to derive ontological deficiencies based on 
a comparison of the constructs in the ontology and the grammar. Ontological weak- 
nesses are identified when corresponding constructs are missing in the obtained map- 
ping between the ontology and the grammar or 1-many (or many-1 or even many- 
many) relationships exist. Such identified deficiencies are the typical starting point for 
the derivation of propositions and then hypotheses. In general, the ontological analy- 
sis does not make any statements regarding the relative importance of these findings 
in comparison with each other. Though this seems to be the established practice, it 
lacks more detailed insights into the significance of the results. It is expected, how- 
ever, that the missing support for a core construct of an ontology can be rated higher 
than a missing corresponding construct for a minor ontological construct or a relation- 
ship. This lack of a more detailed statement regarding the significance of a potential 
shortcoming makes it difficult to judge quickly the outcomes of the results of two 
different sets of analyses, e.g. an ontological analysis of ARIS in comparison with an 
ontological analysis of UML. 

2.8 Lack of Relevance 

Finally, the results of an ontological analysis should be perceived as relevant by the 
related stakeholders. However, if an ontological analysis leads, for example, to the 
outcome that Entity Relationship Models do not support the description of behaviour, 
then it is not surprising that the IS community develops a rather critical opinion. It 
seems that an ontological analysis has to consider the purpose of the grammar as well 
as the background of the modeller who is applying this grammar. The application of a 
high-level and generic ontology does not consider this individual context and there is 
a danger that the outcomes can be perceived as trivial. 
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3 Reference Methodology for Conducting Ontological Analyses 

The above identified shortcomings motivated the development of an enhanced meth- 
odology for ontological analyses. The main purpose of this methodology is to increase 
the rigour, the overall objectivity and the level of detail of the analysis. The proposed 
methodology for ontological analyses is structured in three phases, viz., input, process 
and output. 

3.1 Input 

The formal specification of ontologies, together with the differences in the languages 
used to specify the ontologies and the grammars under analysis, have been classified 
as issues pertaining to the lack of understandability and comparability. 

In order to overcome these shortcomings, it is proposed to convert the ontology as 
well as the selected modelling grammar to meta models using the same language (e.g. 
ER Models or UML Class Diagram). This facilitates a pattern-matching approach 
towards the ontological analyses of completeness and clarity of a grammar. As a first 
step we converted, for example, the Bunge-Wand-Weber ontology into an ER-based 
meta model. This meta model includes 50 entity types and 92 relationship types. It 
has clusters such as system, property or class/kind. Such a meta model explains, in a 
language familiar to the Information Systems (IS) community, the core constructs of 
the ontology. It also highlights the underlying focus of the ontology. In the case of the 
BWW model, for example, it is obvious from the visual inspection of the meta model 
that the ontology is centred around the existence of a thing , which is the central entity 
type in the meta model. 

The obtained meta model can now be used for a variety of ontological analyses. 
Moreover, it allows a critical review of the BWW model by a wider community. The 
approach, however, is not without its limitations. Commonly used modelling tech- 
niques such as ER or UML are often widely accepted, however, they have not been 
designed for the purposes of meta modelling. Thus, they lack occasionally the re- 
quired expressiveness. Fig. 1 provides an impression of the size and complexity of the 
meta model for the BWW ontology. 

While an ER-based meta model helps to overcome issues related to the under- 
standability of an ontology, a corresponding meta model of the analysed grammar is 
required to deal with the lack of comparability issue. Many popular modelling tech- 
niques (e.g. ARIS or UML, and also interoperability standards such as ebXML) are 
already specified in meta models using ER-notations or UML Class Diagrams. If the 
meta models for the ontology and the modelling technique are specified in the same 
language, the ontological analyses turns into a comparison of two conceptual models. 
As part of the analyses, it will be required to identify corresponding entity types and 
relationship types in both models. It also becomes immediately obvious, if the para- 
digm of the analysed grammar differs from the ontology. In the case of ARIS or many 
web services standards, for example, the meta models are centred around functions or 
activities instead of being centred around things. 

3.2 Process 

Issues related to the process of conducting an ontological analysis have been de- 
scribed as lack of completeness, lack of guidance and lack of objectivity. 
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Fig. 1 . The BWW meta model 



Based on the assumption that corresponding meta models for the ontology and the 
analysed grammar are available, it is possible to clearly specify the scope of an analy- 
sis using those meta models. Such a selection of clusters, entity types and relationship 
types would define all elements that are to be perceived of relevance for a complete 
analysis. An analysis of an ER-based notation, for example, could be focused on the 
BWW clusters thing, system and property and could exclude the more behavioural- 
oriented clusters event and state. Such boundaries of an analysis could be easily visu- 
alised in the meta model and would provide a clear description of the comprehensive- 
ness of the analysis. 

The existence of two corresponding meta models and a clear definition of the 
scope of the analysis is a necessary but not a sufficient criteria for a well-guided proc- 
ess. Further guidelines are required regarding the starting point of such a process and 
the actual sequence of activities. Based on our experiences, we recommend starting 
with the representation mapping, i.e. selecting the meta model of the ontology and 
subsequently identifying the corresponding elements in the modelling grammar. The 
first construct to be analysed should be the most central entity type, i.e. in the case of 
the BWW models the entity type thing. Our previous work provides a strong argu- 
ment that this analysis if followed by a cluster-by-cluster approach. Starting with the 
core constructs in a cluster, this allows a more structured and focused analysis of the 
completeness of a modelling grammar. The analysis of the entity types is followed by 
the relationships and the cardinalities. Constructs in the meta model that only have 
been introduced for the correctness of the meta model, but that do not reflect onto- 
logical constructs are excluded from the analysis. The representation mapping is fol- 
lowed by an analysis of the clarity, i.e. the interpretation mapping. In this case the 
meta model of the grammar under analysis is the starting point. The general procedure 
is similar. A main advantage of a cluster-based analysis is that the structure of the two 
meta models provides valuable input for the ontological analysis. An example is the 
analysis of generalisation-specialisation relationships in the meta model of the gram- 
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mar. We propose to ontologically classify the super-type first and then to inherit this 
ontological classification to all sub-types. This streamlines the process of the analyses 
and increases the consistency. 

The lack of objectivity issue, on the other hand, frequently stems from the analysis 
being performed by a single researcher. The situation results in an analysis that is 
almost certainly biased by the researcher’s background as well as their interpretation 
of the specification of the grammar. In order to improve the validity of the analysis, a 
research methodology can be adopted that undertakes individual analyses of a particu- 
lar grammar by at least two members of a research team, followed by consensus as to 
the final analysis by the entire team of researchers. The methodology consists of three 
steps: 

Step 1: Using the specification of the grammar in question, at least two researchers 
separately read the specification and interpret, select and map the ontological 
constructs to candidate grammatical constructs to create individual first drafts of 
the analysis. 

Step 2: The researchers involved in Step 1 of the methodology, meet to discuss and 
defend their interpretations of the representation modelling analysis. This meet- 
ing leads to an agreed second draft version of the analysis that incorporates ele- 
ments of each of the researchers’ first draft analyses. The overlap in the selection 
of the constructs and in the actual ontological analysis can be quantified by vari- 
ous figures that are used in content analysis and other more qualitative research. 

Step 3: The second draft version of the analysis for each of the interoperability can- 
didate standards is used as a basis for defence and discussion in a meeting in- 
volving the entire research team. The outcome of this meeting forms the final 
analysis of the grammar in question. 

Such a methodology was employed in a project that sought to apply the BWW rep- 
resentation model analysis to a number of the leading potential Web Service stan- 
dards, viz-, ebXML, BPML, BPEL4WS and WSC1. The project team was composed 
of four researchers and the standards were analysed in the order: ebXML -> BPML 
-> BPEL4WS -> WSCI. Two researchers were involved in steps 1 and 2 of the meth- 
odology, i.e. the individual analysis of a standard followed by a meeting of the two 
researchers in order to obtain an agreed mapping. This was followed by a meeting of 
the entire team in order to discuss the mapping and arrive at the final analysis. The 
process was performed for each of the four standards. 

Table 1 shows the recorded agreement statistics at the second step of the applied 
methodology while Table 2 shows the recorded agreement statistics at the third step 
of the methodology. 



Table 1 . Summary of Step 2 mapping agreement between both researchers 



Web Service 
Language 


Construct Mapping 
agreed upon by both 
researchers 


Total number of 
specification 
constructs identified 


Mapping 

conference 


ebXML 


43 


51 


84% 


BPML 


36 


46 


78% 


BPEL4WS 


30 


47 


63% 


WSCI 


39 


49 


79% 
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Table 2. Summary of Step 3 mapping agreement 



Web Service 
Language 


Construct Mapping 
agreed upon by the 
team 


Total number of 
specification 
constructs identified 


Mapping conference 


ebXML 


49 


51 


96% 


BPML 


41 


46 


89% 


BPEL4WS 


42 


47 


89% 


WSCI 


46 


49 


94% 



The adoption of such a methodology is seen to have greatly improved the objec- 
tiveness of the carried out analyses. 

3.3 Output 

The three main shortcomings related to the outcome of an ontological analysis have 
been characterised as the lack of adequate result representation, lack of result classifi- 
cation and the lack of relevance. 

The meta models, which have been used as input for the ontological analyses, are 
an appropriate medium to visualise the outcomes of the entire analysis process. In our 
work on the analysis of ARIS, we derived a meta model of the BWW model that 
highlighted all constructs of the ontology that do not have a corresponding construct 
in the grammar under analysis, i.e. we visualised incompleteness in the model using 
simple colour coding. In a similar way, we derived three ARIS meta models that high- 
lighted excess, overload and redundancy in ARIS. Such models form a very intuitive 
way of representing the identified ontological shortcomings. The underlying cluster- 
ing of the models also helps to quickly comprehend the main areas of shortcomings. 

At present time, the process of an ontological analysis results in the identification 
of ontological incompleteness and ontological clarity through the identification of 
missing, overloaded or redundant grammatical constructs. While the end result identi- 
fies such problems, it fails to account for their relative importance. For example, thing 
is one of the fundamental constructs of the BWW model. The lack of mapping for the 
construct should, therefore, be considered more important than the lack of mapping 
for the well-defined event construct for example. There is a need for the development 
of a scoring model that enables the calculation of the ‘goodness’ of a grammar with 
respect to the ontology. In such a scoring model, each of the ontological constructs 
has a value assigned to it that reflects the relative importance of the construct in the 
ontology. Core constructs would therefore have high weightings whereas less impor- 
tant constructs would attract lower values of weightings. Following an ontological 
analysis of a particular grammar, the weighting of all missing constructs would be 
calculated to arrive at one value that generally reflects the outcome of the analysis. 

An example for such a classification could have the following structure. All core 
constructs of an ontology (and the modelling grammar) would get the value 1. All 
other constructs represented as an entity type in the meta model of the ontology would 
receive the value 0.7, and all other constructs get the value 0.3. Such a weighting 
would then be applied to the outcomes of the ontological analysis. The scores would 
be aggregated across the ontology and modelling grammar. They also would be calcu- 
lated separately for completeness, excess, overload and redundancy. Furthermore, 
they could be aggregated per cluster, which allows a more differentiated view on the 
particular strengths of a modelling grammar. Though the consolidated score of such 
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an evaluation should not be overrated, it provides better insights into the characteris- 
tics of the ontological deficiencies and provides a first rating of the significance and 
importance of the identified shortcomings. 

Apart from the lack of result classification that is addressed by the scoring model, 
another problem with the outcome of the analyses has been the perceived lack of 
relevance. Since most modelling grammars focus on modelling a sub-set of the phe- 
nomena that occurs in the real world, it would follow that not all constructs of an 
ontology are necessary in order to analyse such a grammar. If the full ontology is used 
in the analysis, the result may identify potential problems that would not, in reality, 
occur, because the modelling grammar is not used to model any phenomena described 
by the missing constructs. Further, there may also be a need for specialisation of some 
of the ontological constructs in order to enhance analysis of a grammar pertaining to a 
particular domain. The concept of a focussed ontology is shown in Fig. 2. 

Indeed, the outcomes of the ontological analyses of different modelling grammars 
to date appear to support the need for a focused ontology, which consists of different 
subsets of the ontological constructs for different domains. The analyses of the exam- 
ined grammars consistently show that the constructs conceivable state space, con- 
ceivable event space and lawful event space, for example, have no representative 
constructs in the grammars. Such missing constructs, if identified to be unnecessary 
for the particular domain, can be ignored leading to a simpler analysis that does not 
consider phenomena that are deemed to be outside of the scope of the domain. 

Chosen Ontology Focused Ontology Modelling Grammar 




Fig. 2. An extension of ontological analysis through the use of focused ontologies 



4 Summary and Future Work 

There has been a marked increase in the popularity of the application of ontologies for 
the purposes of modelling grammar analysis. For example, a literature review identi- 
fied more than 25 papers that applied the Bunge-Wand-Weber ontology for the analy- 
sis of modelling grammars such as ER (e.g., [19, 22-23], OMT, UML (e.g., [3, 14, 
18], Petri-Nets, ARIS (e.g., [6-7, 16] or Web Services standards such as ebXML, 
BPEL4WS, BPML or WSCI (e.g., [1, 26, 8]. In general, selected ontologies and their 
interpretations, from an Information Systems viewpoint, are reasonably advanced. 
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However, the actual process of conducting an ontological analysis is still rather pre- 
mature. At this stage, the process is focused on the identification of the cardinality of 
the relationships between corresponding elements in the ontology and the modelling 
grammar under analysis. 

In total, eight shortcomings of the current process of ontological analysis have 
been identified and categorised into issues related to the input, process and output of 
the analysis. 

This paper proposed to further enhance the current process of ontological analyses. 
The objectives of such a methodology are 

— to provide guidance for researchers who are interested in conducting ontological 
analyses, 

— to add rigour to the entire process and reduce the dependence on the subjective 
interpretations of the involved researcher, and 

— to overall increase the credibility of the ontological analysis. 

Examples from our ontological analyses of ARIS and various Web Services stan- 
dards have been used to exemplify this methodology. As a consequence, we hope that 
the presented more rigorous process will increase the overall acceptance of using 
ontologies for the analysis, comparison and engineering of various grammars. 
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Abstract. In the past, most conceptual schemas of information systems have 
been developed essentially from scratch. Currently, however, several research 
projects are considering an emerging approach that tries to reuse as much as 
possible the knowledge included in existing ontologies. Using this approach, 
conceptual schemas would be developed as refinements of (more general) on- 
tologies. However, when the refined ontology is large, a new problem that 
arises using this approach is the need of pruning the concepts in that ontology 
that are superfluous in the final conceptual schema. This paper proposes a new 
method for pruning ontologies in this approach. We show the advantages of our 
method with respect to similar pruning methods developed in other contexts. 
Our method is general and it can be adapted to most conceptual modeling lan- 
guages. We give the complete details of its adaptation to the UML. On the other 
hand, the method is fully automatic. The method has been implemented. We il- 
lustrate the method by means of its application to a case study that refines the 
Cyc ontology. 



1 Introduction 

Most conceptual schemas of information systems have been developed essentially 
from scratch. The current situation is not very different: most industrial information 
systems projects are being developed using a methodology that assumes that the con- 
ceptual schema is created every time from scratch. However, it is well-known that 
substantial parts of conceptual schemas can be reused in different projects, and that 
such reuse may increase the conceptual schema quality and the development produc- 
tivity [13]. 

Several research projects explore alternative approaches that try to reuse concep- 
tual schemas as much as possible [3, 12, 21, 22], The objective is similar to that of 
projects in the artificial intelligence field that try to reuse ontologies. There are sev- 
eral definitions of the term “ontology”. We adopt here the one proposed in [7, 24] , in 
which an ontology is defined as the explicit representation of a conceptualization. A 
conceptualization is the set of concepts (entities, attributes, processes) used to view a 
domain. An ontology is the specification of a conceptualization in some language. In 
this paper, we consider a conceptual schema as the ontology an information system 
needs to know. 

Ontologies can be classified in terms of their level of generality into [8]: 

— Top-level ontologies, which describe domain-independent concepts such as space, 
time, etc. 
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— Domain and task ontologies which describe, respectively, the vocabulary related 
to a generic domain and a generic task. 

- Application ontologies, which describe concepts depending on a particular domain 
and task. 

We call top-level, domain and task ontologies general ontologies. One example of 
general ontology is Cyc [11]. 

General ontologies can play several roles in conceptual modeling [22]. One of 
them is the base role. We say that a general ontology plays a base role when it is the 
basis from which the conceptual schema is developed. In general, the development 
requires three main activities: refinement, pruning and refactoring [5] which are re- 
viewed in the next section. The objective of the refinement activity is to extend the 
base ontology with the particular concepts needed in a conceptual schema, and that 
are not defined in that ontology. 

In general, when the base ontology is large, the extended ontology cannot be ac- 
cepted as the final conceptual schema because it includes many superfluous concepts. 
The objective of the pruning activity is then to prune the unnecessary concepts. In this 
paper, we propose a new method for pruning ontologies in the development of con- 
ceptual schemas. To the best of our knowledge, ours is the first method that is inde- 
pendent of the conceptual modeling language used and of the base ontology. The 
method can be used in other contexts as well, and we will show that it has several 
advantages over similar existing methods ([23, 20]). Our method can be adapted to 
most languages, and we give the complete details of its adaptation to the UML [17]. 
We illustrate the method by means of its application to a case study that refines the 
Cyc ontology. The case study deals with the directory of an organization (depart- 
ments, persons, assignment of persons to departments, contact locations, etc.). The 
complete details of the case study are reported in [4 |. 

The structure of the paper is as follows. In the next section we review the three 
main activities in the development of a conceptual schema from a base ontology, with 
the objective of defining the context of the pruning activity, the focus of this paper. 
Section 3 presents the pruning method we propose. Section 4 compares our method 
with similar ones. Finally, Section 5 gives the conclusions and points out future work. 



2 The Context 

In this section we briefly review the three activities required to develop a conceptual 
schema from a general ontology: refinement, pruning and refactoring. Normally, these 
activities will be performed sequentially (see Fig. 1), but an iterative development is 
also possible [5]. 

2.1 Refinement 

Normally, a general ontology 0 G will not include completely the conceptual schema 
CS required by a particular information system. The objective of the refinement activ- 
ity is then to obtain an extended ontology O x such that: 

— O x is an extension of 0 G , and 

— O x includes the conceptual schema CS. 
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Fig. 1 . The three activities in the development of conceptual schemas from general ontologies 



The refinement is performed by the designer. S/he analyzes the IS requirements, de- 
termines the knowledge the system needs to know to satisfy those requirements, 
checks whether such knowledge is already in 0 G and, if not, makes the necessary 
extensions to 0 G , thus obtaining O x . 

In our case study, we adopted as general ontology 0 G OpenCyc [18], the public 
version of the Cyc ontology. OpenCyc includes over 2900 entity types and over 1300 
relationship types. Even if these numbers are large (and even larger in other ontolo- 
gies such as Cyc ) it is likely that additional entity or relationship types may be 
needed for the CS of a particular IS. 

For example, our case study deals with an organization, its departments and the 
people working in them. A department is a sub-organization, part of the organization 
to which it belongs, and it performs some of the activity of that organization. The 
organizational structure is hierarchical, with some departments reporting directly to 
the organization itself, while others report to another department. However, the con- 
cept of Department does not exist 1 in OpenCyc, so we have to add it. We have added 
the entity type Department, as a subtype of the pre-existing Organization (an indirect 
subtype of Agent), the relationship type HasDepartments (see Figure 2), and the con- 
straint that the organizational structure is hierarchical. 

As another example, OpenCyc includes the relationship type HasWorkers between 
two Agents. The meaning is that an agent ( worker ) regularly works for the other 
(work). A person may be worker of several agents. HasWorkers has a supertype 
(WorksWith) and a subtype ( HasEmployees ). The relationship type that fits best our 
needs is HasWorkers. However, in our case study, the participants are Person and 
Organization. We have then refined HasWorkers in Person so that the work must be 
instance of Organization. On the other hand, in our case study, the participation of a 
person in HasWorkers is mandatory (multiplicity 1..* in work role, see Figure 2). 

The complete refinement of OpenCyc for the case study is described in [4]. In 
summary, we have added one entity type ( Department ), one attribute (attribute name 
of Agent, shown in Figure 2) and two associations (one of them is HasDepartments in 
Figure 2). We have added also one refinement of attributes, one of associations 
(shown in Figure 2) and four general integrity constraints. 



In fact, it appears in the documentation, but it is not included yet in the public download. 



l 
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1 lasWorkers 



I lasEmployees 




Fig. 2. (Partial) refinement of OpenCyc in the case study 



2.2 Pruning 

Normally, an extended ontology O x will contain many irrelevant concepts for a par- 
ticular information system. The objective of the pruning activity is then to obtain a 
pruned ontology O p such that: 

— Op is a subset of O x , and 

— 0 P includes the conceptual schema CS, and 

— the concepts in O x but not in O p would have an empty extension in the informa- 
tion system (i.e. they are irrelevant). 

In the case study, we find that the OpenCyc ontology contains thousands of con- 
cepts irrelevant for organizational directory management. For example, the entity and 
relationship types dealing with Chemistry. Our information system is not interested in 
these concepts and, therefore, their extension in the information base would be empty. 
The objective of the pruning activity is to remove such concepts from O x . In the next 
section we present a method for the automatic pruning of ontologies. The input of the 
method is either the formal specification of the IS requirements (domain events, que- 
ries) or the explicit definition of the concepts (entity and relationship types) of inter- 
est. 

2.3 Refactoring 

Normally, a pruned ontology O p cannot be accepted as a final CS because it can be 
improved in several respects. The objective of the refactoring activity is then to obtain 
a conceptual schema CS that is externally equivalent to O p yet improves its structure. 
The purpose of ontology (or conceptual schema) refactoring is equivalent to that of 








126 Jordi Conesa and Antoni Olive 



software refactoring [6]. The refactoring is performed by the designer, but important 
parts of the activity can be assisted or automated, provided that the IS requirements 
are formalized. Refactoring consists in the application of a number of refactoring 
operations to parts of an ontology. Many of the software refactoring operations can be 
adapted to conceptual modeling, but this will not be explored in this paper. 

3 Pruning the Extended Ontology 

In this section, we define the problem of pruning the extended ontology and we pro- 
pose a new method for its solution. The starting point of the pruning activity is an 
extended ontology O x and the functional requirements of the IS. 

3.1 The Extended Ontology 

We assume that, in the general case, an ontology O x consists of sets of the following 
elements: 

— Concepts. There are two kinds of concepts: 

— Entity types. 

— Relationship types. We will denote by R(p 1 \E 1 ,...,p n :E l ) a relationship type R 
with participant entity types E } , .... E n playing roles p v p n respectively. 

— Generalization relationships between concepts. We denote by IsA(C 1 ,C 2 ) the gen- 
eralization relationship between concepts C ; and C 2 . IsA + will be the transitive 
closure of IsA. We admit multiple specialization and multiple classification. 

— Integrity constraints 2 . 

Adaptation to the UML. In the UML an ontology O x consists of sets of the follow- 
ing elements (see Figure 2): 

— Concepts: 

— Entity types. 

— Data types. 

— Attributes. 

— N-ary associations. 

— Association classes, which reify associations. An association class and its 
reifying association are a single element. 

— Generalization relationships between de above concepts. Attributes cannot be gen- 
eralized. 

— Constraints. 

In the UML, some constraints are predefined (they have a particular language con- 
struct) and others may be user-defined. In our method we deal with constraints of the 
following kinds: 

— Cardinality constraints of associations and attributes. 

— Completeness and disjointness of sets of generalizations. 



2 The generalization relationships are (inclusion) constraints also, but we give them an special 
treatment due to its prominent role in ontologies and in conceptual modeling. 
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— Redefinitions of association ends and attributes (refinement constraints). Figure 2 
shows an example of association redefinition: the association HasWorkers is rede- 
fined in Person. 

— General constraints. We assume that general constraints are defined by constraint 
operations and specified in the OCL, as explained in [16]. The adaptation of our 
method to constraints defined as invariants is straightforward. An example is the 
constraint that the name of agents must be unique. Its formal specification is: 

Context Agent :: uniqueName ( ) : Boolean 

body Agent . alllnstances ( ) -> isUnique (name) 

In the case study, O x consists of: 

— 2,697 Entity types and 255 Data types. 

— 266 Attributes and 1,446 Associations. 

— 4 general integrity constraints. 



3.2 Concepts of Direct Interest 

Usually, the extended ontology O x will be (very) large, and only a (small) fraction of 
it will be needed for the CS of a particular IS. The objective of the pruning activity, as 
we will define it below, is to remove some non-needed elements from O x . 

The pruning activity needs to know which concepts from O x are of direct interest 
in the IS. A concept is of direct interest in a given IS if its users and designers are 
interested in representing its population in the Information Base of the IS. 

When the functional requirements of an IS are formally specified, then the con- 
cepts of direct interest Col may be automatically extracted from them [22]. The details 
of the extraction process depend on the method and language used for that specifica- 
tion. We explain here the process when the IS behavior is specified by system opera- 
tions, as is done in many methods such as Larman’s method [10], the B method [1] or 
Fusion [2]. A similar process can be used when the behavior is specified by state- 
charts, event operations or other equivalent methods. 

In general, the formal specification of a system operation consists of: 

— A signature (name, parameters, and result). The types of the parameters and the 
result are entity types defined in O x . 

— A set of preconditions. Each precondition is a boolean expression involving con- 
cepts defined in O x . 

— A set of postconditions. As above, each postcondition is a boolean expression in- 
volving concepts defined in O x . 

The concepts of direct interest Col are then defined as: 

— The types of the parameters and result of the system operations. 

— The concepts appearing in the pre or postconditions. 

In some cases the formal specification may not be available or may be incomplete. 
In these cases, the designers may wish to define the concepts Col explicitly or to add 
new concepts to those determined from the functional specification. Given that our 
pruning method needs to know these concepts, independently of how they have been 
obtained, we allow designers to define them explicitly, either in total or in part. 
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If a relationship type is a concept of direct interest then we require that its partici- 
pant entity types are in Col also. Formally, we say that a set of concepts of direct 
interest Col is complete if for each relationship type R(p 1 :E 1 ,...,p n :E n ) e Col the par- 
ticipant entity types {E } , ■ ■■, E n } c: Col. 

In O x there may be some concepts that generalize those in Col and which are not 
part of Col. We are interested in these generalized concepts because they may be 
involved in constraints that affect instances of the concepts Col. To this end, we call 
set of generalized concepts of interest G(CoT) the concepts of a complete set Col and 
their generalizations. Formally: 

G(CoI) = {c | c e Colv 3sub (IsA + (sub,c) a sub e Col)} 

Adaptation to the UML. The adaptation is straightforward. We assume that the 
pre/postconditions are written in the OCL. For example, consider the system opera- 
tion changeSuper, whose purpose is to change the super of a given department. Its 
formal specifications may be: 

Context System: : changeSuper (sub department , super : Organization) 
pre: super <> sub . super 

post: sub. super = super 

The Col inferred from this operation are: Department , Organization and HasDepart- 
ments. 

3.3 Constrained Concepts 

We call constrained concepts of an integrity constraint ic, CC{ic), the set of concepts 
appearing in the formal expression of ic. By abuse of notation we write CC(0 ) to 
denote the set of concepts constrained by all the integrity constraints defined in ontol- 
ogy O. Formally, 

CC(0 ) = { c | c is a concept ace O a 3ic ( ic is a constraint a ic e O a c e CC(ic)) j 

Adaptation to the UML. If ic is a cardinality constraint of an attribute or association, 
then CC(ic) will be the attribute or association, and the entity and data types involved 
in it. 

If ic is a completeness constraint with a common supertype super and subtypes 
subj, ..., sub n , then CC(ic) = {super, sub 1 ,..., sub n }. 

A disjointness constraint with a common supertype super and subtypes sub p ..., 
sub n , corresponds to n(n-l)/2 disjunction constraints each of which constraints two 
subtypes, sub i and sub-, and super. Strictly speaking, these constraints do not involve 
the supertype super, but in the UML they are attached to sets of generalizations hav- 
ing the same supertype. 

If ic is a redefinition of an association or attribute then CC(ic) will be the redefined 
association or attribute, and the entity and data types involved in the association or 
attribute. 

The constrained concepts of a general constraint will be the entity types, data 
types, attributes, associations and association classes appearing in the OCL expression 
that defines it. For example, if uniqueName is the general constraint defined in 3.1, 
then CC(uniqueName ) = {Agent, name}. 
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3.4 The Pruning Problem 

Given an extended ontology O x and a complete set of concepts of direct interest Col, 
as explained above, the pruning problem consists in obtaining a pruned ontology 0 P 
such that: 

(a) The elements in 0 P are a subset of those in O x . We do not want to add new ele- 
ments to O x in the pruning activity. Additions can be done in the refinement or in 
the refactoring activities. 

(b) 0 P includes the concepts of direct interest Col. These concepts must be included 
in O p because they are referred to in the specification of the system operations. 

(c) If C j and C 2 are two concepts in O p and there is a direct or indirect generalization 
relationship between them in O x , then such relationship must also exist in O p . 
Formally: 

Vcj,c, (Cj e O p a c 2 e O p a IsA + (Cj,c 2 ) e O x — > IsA + (Cj,c 2 ) e O p ) 

(d) O p includes all constraints defined in O x whose constrained concepts are in 
G(CoI). The rationale is that the constraints in O x which constraint the Informa- 
tion Base of 0 P must be part of it. The constraints in O x that involve one or more 
concepts not in G(CoI) cannot be enforced and, therefore, are not part of O p . 

(e) O p is consistent, that is, it is a valid instance of the conceptual modeling lan- 
guage in which it is specified (metamodel). 

(f) 0 P is minimal, in the sense that if any of its elements is removed from it, the re- 
sulting ontology does not satisfy (b-e) above. 

For each O x and Col there is at least an ontology O p that satisfies the above condi- 
tions and, in the general case, there may be more than one. 

To the best of our knowledge, there does not exist a method that obtains 0 P auto- 
matically in a context similar to ours. In what follows we describe a method for the 
problem. In the next section we will compare it with existing similar methods. 

3.5 The Pruning Algorithm 

Our algorithm obtains 0 P in three steps. The algorithm begins with an initial ontology 
O 0 which is exactly O x (that is, O 0 := O x ) and obtains O p . The steps are: 

— Pruning irrelevant concepts and constraints. The result is the ontology O v 

— Pruning unnecessary parents. The result is the ontology O,. 

— Pruning unnecessary generalization paths. The result is O p . 

Pruning irrelevant concepts and constraints. The concepts of direct interest for the 
IS are given in the set Col, and G(CoI) is the set of concepts in which the IS is di- 
rectly or indirectly interested in. However, O 0 may include other concepts, which are 
irrelevant for the IS. Therefore, in the first step we prune from O 0 all concepts which 
are not in G(CoI), that is, we prune the set of concepts: 

IrrelevantConcepts = {c | c is a concept ace O 0 a c i. G{CoI)} 
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Pruning a concept implies the pruning of all generalization relationships in which 
that concept participates. 

Similarly, we prune the constraints in O 0 that are not relevant for the IS, because 
they constraint one or more concepts not in GiCoI). That is, we prune the set of con- 
straints: 

IrrelevantConstraints = 

{ic | ic is a constraint a ic e O 0 a 3c (c e CCiic ) a eg G(CoI)} 

The result of this step is the ontology Of. 

Oj = O 0 - IrrelevantConcepts - IrrelevantConstraints 

In the example of Figure 2, we have that HasWorkers is a concept of interest and, 
therefore, {HasWorkers, WorksWith] <z G(CoI). However, HasEmpIoyees, a subtype 
of HasWorkers, is not a member of GiCoI) and it is then pruned in this step. Like- 
wise, Person is a concept of interest but its subtypes ( Student , HumanChild, Hu- 
manAdult, FemalePerson, MalePerson, etc. not shown in Figure 2) are not, and there- 
fore they are also pruned in this step. The same happens to “lateral” concepts such as 
Atom or Electron. 

In the case study, after the application of this step we have an ontology O t consist- 
ing of: 

— 96 Entity types and 8 Data types. 

— 6 Attributes and 21 Associations. 

— 4 general integrity constraints. 

Pruning unnecessary parents. After the previous step, the concepts of the resulting 
ontology (Of) are exactly G(CoP). However, not all of them are needed in the CS. The 
concepts strictly needed are given by: 

NeededConcepts = Col u CC(Of) 

The other concepts (i.e. those given by G(Co7) - NeededConcepts ) are potentially not 
needed. We can prune the parents of NeededConcepts which are not children of some 
concept in NeededConcepts. Formally, 

UnnecessaryParents = 

{c | c £ NeededConcepts a — i 3c’ (c’ e NeededConcepts a IsA + (c,c’))} 

As we have said before, the pruning of a concept implies the pruning of all gener- 
alization relationships in which that concept participates. 

The result of this step is the ontology Of. 

0-, — O j - UnnecessaryParents 

In Figure 2, an example of unnecessary parent is the association WorksWith. In the 
case study, WorksWith neither is a needed concept of O t nor is a child of some needed 
concept, and therefore it is pruned in this step. 

In the case study, after the application of this step we have an ontology O 2 consist- 
ing of: 

— 23 Entity types and 6 Data types. 

— 6 Attributes and 5 Associations. 

— 4 general integrity constraints. 
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Pruning unnecessary generalization paths. In some cases, the ontology 0 2 may 
contain generalization paths between two concepts such that not all of them are neces- 
sary. The purpose of the third step is to prune these paths. 

We say that there is a generalization path between C ; and C n if: 

— C j and C are two concepts from () ■,, 

— IsA + (C,,C n ) and 

— The path includes two or more generalization relationships 7vA(C ; ,C,), ..., IsA 
(C n .„C n ). 

A generalization path IsA(C 1 ,C 2 ), ■■■■, /.sA(C n / ,C n ) between C ; and C n is potentially 
redundant if none of the intermediate concepts C 2 , . .., C J; / : 

— Is member of the set Col u CC(0 ,) 

— Is the super or the sub of other generalization relationships. 

A potentially redundant generalization path between concepts C 1 and C n is redun- 
dant if there are other generalization paths between the same pair of concepts. In this 
case, we prune the concepts C 2 , ..., C n j and all generalization relationships in which 
they participate. Note that, in the general case, this step is not determinist. 

The output of this step is the pruned ontology, O p . 

Figure 3 shows two generalization paths between the concepts of interest Person 
and Agent. None of the four intermediate concepts is member of Col u CC(0 2 ). How- 
ever, SocialBeing is the super of a generalization of Organization. Therefore, the only 
potentially redundant generalization path is IsA(PersonAnimal ), IsA(Animal, 
PerceptualAgent ), Is A{Perceptual Agent Agent), and it can be pruned from the ontol- 
ogy- 

In the case study, after the application of this step we have an ontology 0 P consist- 
ing of (see Figure 4): 

— 10 Entity types and 6 Data types (not shown in the Figure). 

— 6 Attributes and 5 Associations. 

— 4 general integrity constraints. 



4 Comparison with Previous Work 

The need for pruning an ontology has been described in several research works in the 
fields of information systems and knowledge bases development. We may mention 
Swartout et al. [23], Knowledge Bus [20], Text-To-Onto [9, 14], Wouters et al. [26], 
the ODS (Ontology-Domain-System) approach [25], and OntoLearn [15]. 

Even if the above works differ in the context in which the need for pruning arises, 
the ontology language, the particular ontology used as base, or the selection of the 
concepts of interest, we believe that (at least parts of) our pruning method can be 
adapted to be used successfully in all those works. The reason are: (1) we deal with 
any base ontology; (2) our method can be adapted to any ontology language; (3) we 
take into account the specificity of entity types, relationship types, generalizations and 
constraints present in all complete conceptual modeling languages; and (4) we may 
obtain the concepts of interest from the functional specifications. In the following we 
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Fig. 3. Two generalization paths between Person and Agent 




Fig. 4. The pruned ontology in the case study 



give some comments on the first two works, which are the more closely related to 
ours, and that describe a comparable pruning method. 

The purpose of Swartout et al. [23] is the development of specialized, domain spe- 
cific ontologies from a large base ontology. The base ontology is Sensus, a natural 
language based ontology containing well over 50,000 concepts. The elements of the 
ontology are only entity types and generalization relationships. The concepts of inter- 
est are assumed to be a set of entity types (called the "seed") selected explicitly by 
domain experts, and all entity types that generalize them. The pruning method corre- 
sponds roughly to our first step (pruning irrelevant concepts and constraints). Using 
our method, the domain experts could select the seed, as before, but also the general- 
ized entity types of interest. The two other steps of our method could then be applied 
here, thus obtaining more specific domain ontologies. 
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The purpose of Knowledge Bus [20] is to generate information systems from ap- 
plication-focused subsets of a large ontology. The base ontology is Cyc, and the on- 
tology language is CycL. The concepts of interest are assumed to be the set of entity 
types defined in a context (a subset of Cyc), also called the "seed" set, and all the 
entity and relationship types that can be "navigated" directly or indirectly from them. 
For example, with reference to Figure 2, if the seed set were only {Organization} 
then all entity and relationship types shown in that figure would be considered con- 
cepts of interest. If we consider not only the fragment shown in that figure but the 
complete OpenCyc, then over 700 entity types and 1300 relationship types would be 
considered concepts of interest. The pruning method (called the sub-ontology extrac- 
tor) corresponds here also to our first step (pruning irrelevant concepts and con- 
straints). The result is that (as the authors recognize) many superfluous types are ex- 
tracted from Cyc. Using our method, the domain experts can be more precise with 
respect to the concepts of interest. They could select the seed, as before, but also the 
generalized entity and relationship types of interest. The two other steps of our 
method could then be applied here as well, thus improving the specificity of the sub- 
ontology extraction process. 



5 Conclusions 

We have tried to contribute to the approach of developing conceptual schemas of 
information systems by reusing existing ontologies. We, as many others, believe that 
this approach offers a great potential for increasing both the conceptual schema qual- 
ity and the development productivity. 

We have focused on the problem of pruning ontologies. The problem arises when 
the reused ontology is large and it includes many concepts which are superfluous for 
the final conceptual schema. The objective of the pruning activity is to remove these 
superfluous concepts. 

We have presented a new formal method for pruning an ontology. The input to our 
method is the ontology and the set of concepts of interest. When the functional re- 
quirements are formally specified, the concepts of interest can be automatically ex- 
tracted from them. Otherwise, the designer has to define them explicitly. From this 
input, our method obtains automatically a pruned ontology, in which most of the su- 
perfluous concepts have been removed. 

We have formalized the method independently of the conceptual modeling lan- 
guage used. However, the method can be adapted to most languages. We have shown 
the details of its adaptation to the UML. On the other hand, our method can be used 
with any ontology. We have illustrated the method by means of its application to a 
case study that refines the public version of the Cyc ontology. We have shown that 
our method improves on similar existing methods, due to its generality and greater 
pruning effectiveness. 

We plan to continue our work in (at least) three directions. First, our method as- 
sumes the pruning activity in the context of the development of a conceptual schema, 
but the method can be used in other contexts as well. In particular, we would like to 
use it in the development of domain ontologies. Second, we would like to adapt the 
method to other general languages such as the OWL (Web Ontology Language) [19]. 
Finally, we plan to work on the activity that follows pruning: refactoring. The large 
amount of existing work on schema transformation can be “reused” for that purpose. 
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Abstract. Most current conceptual modeling languages and methods do not 
model events as entities. We argue that, at least in Object-Oriented (O-O) lan- 
guages, modeling events as entities provides substantial benefits. We show that 
a method for behavioral modeling that deals with event and entity types in a 
uniform way may yield better behavioral schemas. The proposed method makes 
an extensive use of language constructs such as constraints, derived types, deri- 
vation rules, type specializations and operations, which are present in all com- 
plete 0-0 conceptual modeling languages. The method can be adapted to most 
0-0 languages. In this paper we explain its adaptation to the UML. 



1 Introduction 

According to the well-known 100 Percent (or completeness) principle, a conceptual 
schema must include all relevant general static and dynamic aspects [18]. The part of 
a conceptual schema that deals with static aspects is called the structural schema (or 
subschema), and the part that deals with dynamic aspects is called the behavioral 
schema. This paper focuses on behavioral schemas. 

The approaches to behavioral modeling taken by conceptual modeling languages 
are diverse. The main differences are due to the style in which each language is based 
(logic, structured, object-oriented (0-0), temporal, Petri nets, etc.) and to the degree 
of formalization (informal, semiformal, formal) they aim at. These approaches have 
been surveyed and compared in (among many others) [12,34,2,25]. 

In 0-0 languages, an important classification of behavioral modeling approaches 
is with respect to whether or not they model events as entities (objects or individuals). 
When events are entities, they are modeled in a way similar to ordinary entities: they 
are instance of event types (a special kind of entity types), they may participate in 
relationships, they can be specialized or generalized, and so on. When events are not 
considered entities, they are modeled by means of other language constructs, usually 
as invocations of operations or actions. 

Most current conceptual modeling languages and methods do not model events as 
entities. Among them, we may mention the well-known Syntropy [8], Fusion [7], 
Object-oriented SSADM [30], ROOM [32], B [1], TROLL [19], Statecharts [16], 
IDEA [6], Catalysis [10], IDEFIX [17], Larman’s method [20] and Executable UML 
[23], 
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In this paper we argue that, at least in 0-0 languages, modeling events as entities 
provides substantial benefits. We show that a method for behavioral modeling that 
deals with event and entity types in a uniform way may yield better behavioral sche- 
mas. 

The idea that events can be modeled as objects is not new in the conceptual model- 
ing field. It was suggested in the beginning of the 80’s [3,4,24], and it was (at least 
partially) adopted in a few languages and methods developed around the early 90’s, 
such as OSA [13], KAOS [11], IF0 2 [33], and Martin and Odells’ method [21], How- 
ever, later methods have advocated approaches to behavioral modeling that use (only) 
state transition diagrams and/or operations. 

Currently, there does not exist any UML-based method that models events as enti- 
ties. The UML language distinguishes four kinds of events: (1) call events, which are 
invocations of operations; (2) signal events, which are similar to objects, but are lim- 
ited and intended for asynchronous communication between objects; (3) change 
events, which are satisfaction of boolean conditions; and (4) time events, which are 
satisfaction of time expressions [29, 31]. Therefore the UML does not provide an 
appropriate construct for modeling events as objects at the conceptual level. Change 
and time events are useful constructs in conceptual modeling, but they can be used 
only in state transition diagrams. 

We propose a new method that makes an extensive use of language constructs 
such as constraints, derived types, derivation rules, type specializations and opera- 
tions, which are present in all complete modern conceptual modeling languages. The 
method can be adapted to most 0-0 languages. We explain in this paper its adapta- 
tion to the UML. 

The structure of the paper is as follows. Next Section defines the terminology we 
use, and delimits the scope of our work. Section 3 presents the basics of the method 
we propose. Section 4 describes a useful extension to the basic method. Finally, Sec- 
tion 5 summarizes the conclusions and points out to future work. The examples in the 
paper will be about a fragment of an elementary version of a Material Requirements 
Planning (MRP) system. Figure 1 shows the structural schema of that fragment. The 
details are introduced where they arise. 

2 Events 

In this section, we introduce the terminology, definitions and classifications of events 
used in the paper, and we delimit the scope of our method [21]. 

2.1 Domain Events 

An Information System (IS) maintains a representation of the state of a domain in its 
Information Base (IB). The state of a domain at some time point is the set of instances 
of the relevant entity and relationship types that exist in the domain at that time. The 
state of the domain at time t changes if the domain state at that time, t, is different 
from the domain state at the previous time, t-1. A state change consists in a set of one 
or more structural events. A structural event is an elementary change in the popula- 
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tion of an entity or relationship type [9]. The precise number and meaning of struc- 
tural events depend on the conceptual modeling language used. In the logic language, 
there are only two kinds of structural events: insertion and removal of facts. In the 
UML, there are nine kinds of structural events: create object, reclassify object, etc. 
[29, p.203+1. A domain event is a state change that consists of a set of structural 
events that are perceived or considered as a single change in the domain. The time at 
which the change occurs is the occurrence time of the event. In principle, two or 
more domain events may occur at the same time. 




Fig. 1. Fragment of the structural schema for an MRP application 



Usually, domain events are caused by actions performed in the domain. However, 
in some cases the users may delegate the task of producing some domain events to the 
IS. We can then classify the domain events in terms of where (the source) they have 
been produced: either in the domain or in the IS. They are called external and gener- 
ated domain events, respectively. Some of the external events are produced by just 
the passing of time, and they are called temporal domain events. We briefly review 
these three kinds of domain events in the following. 

An external domain event is caused by an action performed in the domain. The 
event occurs independently from the IS. An example of external domain event is the 
reception of a scheduled receipt. 

A temporal domain event is caused by the passing of time. The domain is in some 
state and the simple passing of time changes it. The event occurs independently from 
the IS. An example of temporal domain event could be the arrival of the day after the 
due date of a scheduled receipt. The scheduled receipt becomes an overdue order by 
just the passing of time (the entity type OverdueOrder, subtype of ScheduledReceipt, 
is not shown in Figure 1). 

A generated domain event is caused by actions performed by the IS itself. Gener- 
ated domain events are caused when some generating condition C is satisfied. The IS 
detects when C is satisfied and, at that time, it generates the corresponding domain 
event. In principle, the generating condition might take any form. However, the most 
widely used particular forms are: 

— State-based. The change of the truth value of a boolean condition over the IB in 
two consecutive states. 

— Event-based. The occurrence of an event, when the IB satisfies a given condition. 
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An example of generated domain event with a state-based generating condition 
could be the automatic generation of purchase order releases. The generating condi- 
tion could be: '‘The quantity on hand plus the total expected receipts of a product is 
equal or greater than the sum of the required quantities of that product”. When the 
truth value of the condition changes between two consecutive states (from true to 
false) the system must generate a “purchase order release” for the corresponding 
product. 

2.2 Query Events 

A query event is a request for information to which an IS must respond. Query 
events are not changes in the domain state represented in the IB. The source of query 
events may be external to the IS or the IS itself. They are then called external or gen- 
erated query events, respectively. 

An external query event is issued by an actor. Most query events are external. A 
generated query event is an implicit request for information issued by the IS itself. 
Similar to generated domain events, an IS may generate a query event when a gener- 
ating condition is satisfied. In principle, the generating condition might take any 
form. However, two particular forms are widely used: 

— State-based. The change of the truth value of a boolean condition over the IB in 
two consecutive states. For example, sending an automatic reminder notice to a 
vendor when a scheduled receipt is approaching its due date. 

— Event-based. The occurrence of an event, when the IB satisfies a given condition. 

2.3 Scope of This Paper 

This paper deals with domain and query events that are external or generated with an 
event-based generating condition. The reason for leaving aside domain and query 
events that are generated with a state-based generating condition, and the temporal 
domain events, is that their definition in most 0-0 languages (including the UML) 
requires the use of state transition diagrams, which are not discussed here. 

3 The Basics of the Method 

3.1 Events as Entities 

Our method adopts the view that events are similar to ordinary entities and, therefore, 
that events can be modeled as a special kind of entities, which we call event entities 
[3,4,24]. Event entities are instance of event types. An event type is a concept whose 
instances, at a given time, are identifiable events that occur at that time. Like any 
other entity, event entities may participate in relationships. 

In non-temporal IBs, event entities exist in the IB only during its occurrence time. 
It is assumed that events are instantaneous, that the IS response to them is also instan- 
taneous, and that after the response (and before the next time tick) the events are 
removed from the IB. In this paper we do not deal with temporal IBs. 
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The adaptation of our method to a particular 0-0 language requires a linguistic 
construct to define event types. In the UML, we use for this purpose a new stereo- 
type, that we call «event». A type with this stereotype defines an event type. 

Like any other entity type, event types may be specialized and/or generalized. This 
will allow us to build a taxonomy of event types, where common elements are de- 
fined only once. It is convenient to define a root entity type, named Event, as shown 
in Figure 2. All event types are direct or indirect subtypes of Event. In fact, Event is 
defined as derived by union of its subtypes. We define in this event type the attribute 
time, which gives the occurrence time of each event. We define also the abstract op- 
eration effect, whose purpose will be made clear later. It is not necessary to stereotype 
event types as «event» because all direct or indirect subtypes of Event will be 
considered event types. 

The view of events as entities is not restricted to domain events. We apply it also 
to query events. 



3.2 Event Characteristics 

The characteristics of an event are the set of relationships in which it participates. 
There is at least one relationship between each event entity and a time point, repre- 
senting the event occurrence time. We assume that the characteristics of an event are 
determined when the event occurs, and remain fixed. 

In a particular language, the characteristics of events should be modeled like those 
of ordinary entities. In the UML, we model them as attributes or associations. Fig- 
ure 2 shows the definition of the external domain event type NewProduct, with four 
attributes (including time) and an association with Vendor. The immutability of char- 
acteristics can be defined by setting their isReadOnly property to true (not shown in 
the Figure) [29, p. 89+]. 




Fig. 2. Definition of event type NewProduct 
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Event characteristics may be derived. The value for a derived characteristic may be 
computed from other characteristics and/or the state of the IB when the event occurs, 
as specified by the corresponding derivation rule. The practical importance of derived 
characteristics is that they can be referred to in any expression (integrity constraints, 
effect, etc.) exactly as the base ones, but their definition appears in a single place 
(derivation rule). 

In the UML, derived elements (attributes, associations, entity types) are marked 
with a slash (/). We define derivation rules by means of defining operations [26]. In 
the example of Figure 2, attribute vendorName gives the name of the vendor that will 
supply the new product. The association between NewProduct and Vendor may be 
derived from the vendor’s name. The defining operation: NewProduct : : vendor ( ) : 
Vendor gives the vendor associated with an event instance. In the UML 2.0, the 
result of operations is specified by a body expression [29, p. 76+]. Using the OCL, 
the formal specification of the above operation may be: 

context NewProduct :: vendor ( ) : Vendor 

body: Vendor . alllnstances ( ) -> any (name = self . vendorName) 

3.3 Event Constraints 

An event constraint is a condition an event must satisfy to occur [8]. An event con- 
straint involves the event characteristics and the state of the IB before the event oc- 
currence. It is assumed that the state of the IB before the event occurrence satisfies all 
defined constraints. Therefore, an event E can occur when the domain is in state S if: 
(1) state S satisfies all constraints, and (2) event E satisfies its event constraints. 

An IS checks event constraints when the events occur and the values of their char- 
acteristics have been established, but before the events have any effect in the IB or 
produce any answer. Events that do not satisfy their constraints are not allowed to 
occur and, therefore, they must be rejected. Event constraints checking is (assumed to 
be) done instantaneously. 

In a particular conceptual modeling language, event constraints can be represented 
like any other constraint. In the UML, they can be expressed as invariants or as con- 
straint operations [27]. Event constraints are always creation-time constraints because 
they must be satisfied when events occur. Here we will define constraints by opera- 
tions, called constraint operations, and we specify them in the OCL. In the UML, we 
show graphically constraint operations with the stereotype «IC». The result of the 
evaluation of constraint operations must be true. 

A constraint of the NewProduct event (Figure 2) is that the product being added 
cannot exist already. We define it with the constraint operation doesNotExist. The 
specification in the OCL is: 

context NewProduct :: doesNotExist ( ) : Boolean 

body: not Product . alllnstances ( ) -> 

exists (productNo = self . productNo) 

On the other hand, the vendor must exist. This is also an event constraint. How- 
ever, in this case the constraint can be expressed as a cardinality constraint. The mul- 
tiplicity 1 in the vendor role requires that each instance of NewProduct must be 
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linked to exactly one vendor. The constraint is violated if the vendor{) operation does 
not return an instance of Vendor. 

An event constraint defined in a supertype applies to all its direct and indirect in- 
stances. This is one of the advantages of defining event taxonomies: common con- 
straints can be defined in a generalized event type. Figure 3 shows an example. The 
event type ExistingProductEvent is defined as the union of New Requirement, Pur- 
chaseOrderRelease and ProductDetails . The constraint that the product must exist is 
defined in ExistingProductEvent, and it applies to all its indirect instances. Note that 
the constraint has been defined by a cardinality constraint, as explained above. Al- 
though it is not shown in Figure 3, the event type ExistingProductEvent is a subtype 
of Event. 

Figure 3 shows also the constraint vcilidDate in NewRequirement. The constraint is 
satisfied if dateRequired is greater than the event date. 




Fig. 3. ExistingProductEvent is asubtype of Event (not shown here) and a common supertype of 
domain event types. NewRequirement and PurchaseOrderRelease and of the query event type 
ProductDetails 

3.4 Query Events Effects 

The effect of a query event is an answer providing the requested information. The 
effect is specified by an expression whose evaluation on the IB gives the requested 
information. The query is written in some language, depending on the conceptual 
modeling language used. In the UML, we can represent the answer to a query event 
and the query expression in several ways. We explain one of them here, which can be 
used as is, or as a basis for the development of alternative ways. 

The answer to a query event is modeled as one or more attributes and/or associa- 
tions of the event, with some predefined name. In the examples, we shall use names 
starting with answer. An alternative could be the use of a stereotype to indicate that 
an attribute or association is the answer of the event. 

Now, we need a way to define the value of the answer attributes and associations. 
To this end, we use the operation effect that we have defined in Event. This operation 
will have a different specification in each event type. For query events, its purpose is 
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to specify the values of the answer attributes and associations. The specification of 
the operation can be done by means of postconditions, using the OCL. 

Figure 3 shows the representation of external query event type ProductDetails. 
The answer is given by attribute: 

answer : TupleType (qoh : natural , vendorName : String) 

The specification of the effect operation may be: 

context ProductDetails :: effect ( ) 

post: answer = Tuple (qoh = product . quantityOnHand, 

vendorName = product . vendor . name) 

Alternatively, in 0-0 languages the answer to a query event could be specified as 
the invocation of some operation. The effect of this operation would then be the an- 
swer of the query event. 

3.5 Domain Events Effects: The Postcondition Approach 

The effect of a domain event is a set of structural events. There are two main ap- 
proaches to the definition of that set: the postconditions and the structural events 
approaches [25]. These approaches are called declarative and imperative specifica- 
tions, respectively, in [34]. In the former, the definition is a condition satisfied by the 
IB after the application of the event effect. In the latter, the definition is an expression 
whose evaluation gives the corresponding structural events. Both approaches can be 
used in our method, although we (as many others) tend to favor the use of postcondi- 
tions. We deal with the postcondition approach in this subsection, and the structural 
events approach in the next one. 




Fig. 4. Definition of OrderReception and OrderReschedule event types 



In the postcondition approach, the effect of an event Ev is defined by a condition C 
over the IB. The idea is that the event Ev leaves the IB in a state that satisfies C. It is 
also assumed that the state after the event occurrence satisfies all constraints defined 
over the IB. Therefore, the effect of event Ev is a state that satisfies condition C and 
all IB constraints. 
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In an 0-0 language, we can represent the effect of a domain event in several ways. 
As we did for query events, we explain one way here, which can be used as is, or as a 
basis for the development of alternative ways. We define a particular operation in 
each domain event type, whose purpose is to specify the effect. To this end, we use 
the operation effect that we have defined in Event. This operation will have a different 
specification in each event type. Now, the postcondition of this operation will be 
exactly the postcondition of the corresponding event. As we have been doing until 
now, in the UML we also use the OCL to specify these postconditions formally. 

As an example, consider the external domain event type NewRequeriment, shown 
in Figure 3. The effect of one instance of this event type is the addition of one in- 
stance into entity type Requirement (see Figure 1). Therefore, in this case the specifi- 
cation of the effect operation is: 

context NewRequirement :: effect ( ) 

post TherelsANewInstanceOfRequirement : 
r.oclIsNewO and 
r . oclIsTypeOf (Requirement) and 
r . dateRequired = dateRequired and 
r. quantity = quantity and 
r. product = product 

In our method, we do not define preconditions in the specification of effect opera- 
tions. The reason is that we implicitly assume that the events satisfy their constraints 
before the application of their effect. In the example, we assume implicitly that a 
NewRequirement event references an existing product, and that its required date is 
valid. The postcondition states simply that a new instance of Requirement has been 
created in the IB, with the corresponding values for its attributes and association. Any 
implementation of the effect operation that leaves the IB in a state that satisfies the 
postcondition and the IB constraints is valid. 

Another example is the external domain event type OrderReception (see Figure 4). 
An instance of OrderReception occurs when a scheduled receipt is received. The 
event effect is that the purchase order now becomes ReceivedOrder (see Figure 1), 
and that the quantity on hand of the corresponding product is increased by the quan- 
tity received. We specify this effect with two postconditions of effectQ in OrderRe- 
ception'. 

context OrderReception :: effect ( ) 
post TheOrderlsNowReceived : 
scheduledReceipt . oclIsTypeOf (ReceivedOrder) and 
scheduledReceipt . oclAsType (ReceivedOrder) . receptionDate = 
CurrentDate 

post TheQuantityOnHandlsIncreased : 

scheduledReceipt .product . quantityOnHand = 

scheduledReceipt . product . quantityOnHand@pre + 
scheduledReceipt . quantity 



3.6 Domain Events Effects: The Structural Events Approach 

In the structural events approach, the effect of an event Ev is defined by an expression 
written in some language. The idea is that the evaluation of the expression gives the 
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set S of structural events corresponding to the event Ev effect. The application of S to 
the previous state of the IB produces the new state. The new state of the IB is the 
previous state plus the entities or relationships inserted, and minus the entities or 
relationships deleted. This approach is in contrast with the previous one, which de- 
fines a condition that characterizes the state of the IB after the event. It is assumed 
that the set S is such that it leaves the IB in a new state that satisfies all the con- 
straints. Therefore, when defining the expression, one must take into account the 
existing constraints, and to ensure that the new state of the IB will satisfy all of them. 

Our method could be used in 0-0 languages that follow the structural events ap- 
proach. The idea is to provide a method for the effect operations. The method is a 
procedural expression, written in the corresponding language, whose evaluation 
yields the structural events. 

3.7 Comparison with Previous Work 

In most current conceptual modeling methods and languages, events are not consid- 
ered objects. Instead of this, events are represented as invocations of actions or opera- 
tions, or the reception of signals or messages. Event types are defined by means of 
operations (with their signatures) or an equivalent construct. 

We believe that the view of events as entities (albeit of a special kind) provides 
substantial benefits to behavioral modeling. The reason is that the uniform treatment 
of event and entity types implies that most (if not all) language constructs available 
for entity types can be used also for event types. In particular: (1) Event types with 
common characteristics, constraints, derivation rules and effects can be generalized, 
so that common parts are defined in a single place, instead of repeating them in each 
event type. We have found that, in practice, many event types have characteristics, 
constraints and derivation rules in common with others [14]; (2) The graphical nota- 
tion related to entity types (including attributes, associations, multiplicities, generali- 
zation, packages, etc.) can be used also for event types; and (3) Event types can be 
specialized in a way similar to entity types, as we explain in the next section. 

4 Event Specialization 

One of the fundamental constructs of 0-0 conceptual modeling languages is the 
specialization of entity types. When we consider events as entities, we have the pos- 
sibility of defining specializations of event types. We may use these specializations 
when we want to define an event type whose characteristics, constraints and/or effect 
are extensions and/or specializations of another event type. 

For example, assume that some instances of NewRequirement are special because 
they require a large quantity of their product and, for some reason, the quantity re- 
quired must be ordered immediately to the corresponding vendor. This special behav- 
ior can be defined in a new event type, SpecialRequirement, defined as a specializa- 
tion of NewRequirement , as shown Figure 5. 

Note that SpecialRequirement redefines the constraint validDate, and adds a new 
constraint called largeQuantity. The required date of the new events must be beyond 



146 Antoni Olive 




Fig. 5. Two specializations of the event type NewRequirement (Fig. 3) 

the current date plus the vendor’s lead time, and the quantity required must be at least 
ten times the product order minimum. 

In the UML, the body of operations may be overridden when an operation is rede- 
fined, whereas preconditions and postconditions can only be added [29, p. 78]. There- 
fore, we redefine validDate as: 

context SpecialRequirement :: validDate ( ) : Boolean 

body: dateRequired > time. date + product . vendor . leadTime 

The new constraint largeQuantity can be defined as: 

context SpecialRequirement :: largeQuantity ( ) : Boolean 

body: quantity > product . orderMinimum * 10 

The effect of a SpecialRequirement is the same as that of a NewRequirement, but 
we want the system to generate an instance of PurchaseOrderRelease (see Figure 3). 
We define this extension as an additional postcondition of the effect operation: 

context SpecialRequirement: : effect () 

post CreateAnlnstanceOf EventTypePurchaseOrderRelease : 
pOR.oclIsNewO and 

pOR . oclIsTypeOf (PurchaseOrderRelease) and 
pOR.productNo = self . productNo and 
pOR. quantity = self . quantity and 
pOR.dueDate = self . dateRequired 

On the other hand, we can define event types derived by specialization. A derived 
event type is an event type whose instances at any time can be inferred by means of a 
derivation rule. An event type Ev is derived by specialization of event types Evj, ..., 
Ev n when Ev is derived and their instances are also instance of Ev p ..., Ev n [28]. We 
may use event types defined by specialization when we want to define particular 
constraints and/or effect for events that satisfy some condition. 

For example, suppose that some instances of NewRequirement are urgent because 
they are required within the temporal horizon of the current MRP plan (seven days), 
and therefore they could not have been taken into account when the plan was gener- 
ated. We want a behavior similar to the previous example. The difference is that now 
we determine automatically which are the urgent requirements. We define a new 






Definition of Events and Their Effects 147 



event type, UrgentRequirement, shown Figure 5, defined as derived by specialization 
of NewRequirement. 

In the UML, the name we give to the defining operations for derived entity types is 
allinstances [26]. In this case, alllnstances is a class operation that gives the popula- 
tion of the type. The derivation rule of UrgentRequirement is then: 

context UrgentRequirement:: 

alllnstances ( ) : Set (UrgentRequirement) 

body: NewRequirement . alllnstances ( ) -> 

select (dateRequired < time. date + 7) 

The effect of an urgent requirement is the same as that of a new requirement, but 
again we want the system to generate an instance of PurchaseOrderRelease (see 
Figure 3). We would define this extension as an additional postcondition of the effect 
operation, as we did in the previous example. 

Comparison with Previous Work. Event specialization is not possible when events 
are seen as operation invocations. The consequence is that this powerful modeling 
construct cannot be used in methods like those mentioned in the Introduction. 

5 Conclusions 

In the context of 0-0 conceptual modeling languages, we have proposed a method 
that models events as entities (objects), and event types as a special kind of entity 
types. The method makes an extensive use of language constructs such as constraints, 
derived types, derivation rules, type specializations, operations and operation redefi- 
nition, which are present in all complete conceptual modeling languages. The method 
can be adapted to most 0-0 languages. In this paper we have explained in detail its 
adaptation to the UML. The method is fully compatible with the UML-based CASE 
tools, and thus it can be adopted in industrial projects, if it is felt appropriate. 

The main advantage of the method we propose is the uniform treatment we give to 
event and entity types. The consequence is that most (if not all) language constructs 
available for entity types can be used also for event types. Event types may have 
constraints and derived characteristics, like entity types. Characteristics, constraints 
and effects shared by several event types may be defined in a single place. Event 
specialization allows the incremental definition of new event types, as refinements of 
their supertypes. Historical events ease the definition of constraints, derivation rules 
and event effects. In summary, we believe that the view of events as entities provides 
substantial benefits to behavioral modeling. 

Among the work that remains to be done, there is the integration of the proposed 
method with the state transition diagrams. These diagrams allow defining the kinds of 
events that were beyond the scope of this paper. 
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Abstract. An open challenge is to integrate XML and conceptual mod- 
eling in order to satisfy large-scale enterprise needs. Because enterprises 
typically have many data sources using different assumptions, formats, 
and schemas, all expressed in - or soon to be expressed in - XML, it is 
easy to become lost in an avalanche of XML detail. This creates an op- 
portunity for the conceptual modeling community to provide improved 
abstractions to help manage this detail. We present a vision for Concep- 
tual XML (C-XML) that builds on the established work of the concep- 
tual modeling community over the last several decades to bring improved 
modeling capabilities to XML-based development. Building on a frame- 
work such as C-XML will enable better management of enterprise-scale 
data and more rapid development of enterprise applications. 

1 Introduction 

A challenge [3] for modern enterprise modeling is to produce a simple conceptual 
model that: (1) works well with XML and XML Schema; (2) abstracts well for 
conceptual entities and relationships; (3) scales to handle both large data sets 
and complex object interrelationships; (4) allows for queries and defined views 
via XQuery; and (5) accommodates heterogeneity. 

The conceptual model must work well with XML and XML Schema be- 
cause XML is rapidly becoming the cle facto standard for business data. Because 
conceptualizations must support both high-level understanding and high-level 
program construction, the conceptual model must abstract well. Because many 
of today’s huge industrial conglomerations have large, enterprise-size data sets 
and increasingly complex constraints over their data, the conceptual model must 
scale up. Because XQuery, like XML, is rapidly becoming the industry standard, 
the conceptual model must smoothly incorporate both XQuery and XML. Fi- 
nally, because we can no longer assume that all enterprise data is integrated, 
the conceptual model must accommodate heterogeneity. Accommodating het- 
erogeneity also supports today’s rapid acquisitions and mergers, which require 
fast-paced solutions to data integration. 

We call the answer we offer for this challenge Conceptual XML (C-XML). 
C-XML is first and foremost a conceptual model, being fundamentally based on 
object-set and relationship-set constructs. As a central feature, C-XML supports 
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high-level object- and relationship-set construction at ever higher levels of ab- 
straction. At any level of abstraction the object and relationship sets are always 
first class, which lets us address object and relationship sets uniformly, inde- 
pendent of level of abstraction. These features of C-XML make it abstract well 
and scale well. Secondly, C-XML is “model-equivalent” [9] with XML Schema, 
which means that C-XML can represent each component and constraint in XML 
Schema and vice versa. Because of this correspondence between C-XML and 
XML Schema, XQuery immediately applies to populated C-XML model in- 
stances and thus we can raise the level of abstraction for XQuery by applying it 
to high-level model instances rather than low-level XML documents. Further, we 
can define high-level XQuery-based mappings between C-XML model instances 
over in-house, autonomous databases, and we can declare virtual views over 
these mappings. Thus, we can accommodate heterogeneity at a higher level of 
abstraction and provide uniform access to all enterprise data. 

Besides enunciating a comprehensive vision for the XML/conceptual- modeling 
challenge [3], our contributions in this paper include: (1) mappings to and from 
C-XML and XML Schema, (2) defined mechanisms for producing and using first- 
class, high-level, conceptual abstractions, and (3) XQuery view definitions over 
both standard and federated conceptual-model instances that are themselves 
conceptual-model equivalent. As a result of these contributions, C-XML and 
XML Schema can be fully interclrangable in their usage over both standard and 
heterogeneous XML data repositories. This lets us leverage conceptual model 
abstractions for high-level understanding while retaining all the complex details 
involved with low-level XML Schema intricacies, view mappings, and integration 
issues over heterogeneous XML repositories. 

We present the details of our contributions as follows. Section 2 describes 
C-XML. Section 3 shows that C-XML is “model-equivalent” with XML Schema 
by providing mappings between the two. Section 4 describes C-XML views. We 
report the status of our implementation and conclude in Section 5. 



2 C-XML: Conceptual XML 

C-XML is a conceptual model consisting of object sets, relationship sets, and 
constraints over these object and relationship sets. Graphically a C-XML model 
instance M is an augmented hypergraplr whose vertices and edges are respec- 
tively the object sets and relationship sets of M, and whose augmentations 
consist of decorations that represent constraints. Figure 1 shows an example. 

In the notation boxes represent object sets - dashed if lexical and not dashed 
if nonlexical because their objects are represented by object identifiers. With 
each object set we can associate a data frame (as we call it) to provide a rich 
description of its value set and other properties. A data frame lets us specify, 
for example, that OrderDate is of type Date or that ItemNr values must satisfy 
the value pattern “[A-Z]{3}-\d{7}”. Lines connecting object sets are relation- 
ship sets; these lines may be hyper-lines (lryper-edges in lryper-graplrs) with 
diamonds when they have more than two connections to object sets. Optional 
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Fig. 1 . Customer/Order C-XML Model Instance. 



or mandatory participation constraints respectively specify whether objects in 
a connected relationship may or must participate in a relationship set (an “o” 
on a connecting relationship-set line designates optional while the absence of 
an “o” designates mandatory ). Thus, for example, the C-XML model instance 
in Figure 1 declares that an Order must include at least one Item but that an 
Item need not be included in any Order. Arrowheads on lines specify functional 
constraints. Thus, Figure 1 declares that an Item has a Price and a Descrip- 
tion and is in a one-to-one correspondence with ItemNr and that an Item in 
an Order has one Qty and one SalePrice. In cases when optional and manda- 
tory participation constraints along with functional constraints are insufficient 
to specify minimum and maximum participation, explicit min.. max constraints 
may be specified. Triangles denote generalization/specialization hierarchies. We 
can constrain ISA hierarchies by partition (l±l), union (U), or mutual exclusion 
(+) among specializations. Any object-set/relationship-set connection may have 
a role, but a role is simply a shorthand for an object set that denotes the subset 
consisting of the objects that actually participate in the connection. 

3 Translations Between C-XML and XML Schema 

Many translations between C-XML and XML Schema are possible. In recent ER 
conferences, researchers have described varying conceptual-model translations 
to and/or from XML or XML DTD’s or XML-Schema-like specifications. (See, 
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for example, [4,6,10].) It is not our purpose here to argue for or against a 
particular translation. Indeed, we would argue that a variety of translations may 
be desirable. For any translation, however, we require information and constraint 
preservation. This ensures that an XML Schema and a conceptual instantiation 
of an XML Schema as a C-XML model instance correspond and that a system 
can reflect manipulations of the one in the other. 

To make our correspondence exact, we need information- and constraint- 
preserving translations in both directions. We do not, however, require that 
translations be inverses of one another ■ translations that generate members of 
an equivalence class of XML Schema specifications and C-XML model instances 
are sufficient. In Section 3.1 we present our C-XML-to-XML-Sclrema transla- 
tion, and in Section 3.2 we present an XML-Schema-to-C-XML translation. In 
Section 3.3 we formalize the notions of information and constraint preservation 
and show that the translations we propose preserve information and constraints. 



3.1 Translation from C-XML to XML Schema 

We now describe our process for translating a C-XML model instance C to an 
XML Schema Sc- We illustrate our translation process with the C-XML model 
instance of Figure 1 translated to the corresponding XML Schema excerpted in 
Figure 2. 

Fully automatic translation from C to Sc is not only possible, but can be 
done with certain guarantees regarding the quality of Sc- Our approach is based 
on our previous work [8] , which for C generates a forest of scheme trees Fc such 
that (1) Fc has a minimal number of scheme trees, and (2) XML documents 
conforming to Fc have no redundant data with respect to functional and multi- 
valued constraints of C. For our example in Figure 1, the algorithms in [8] will 
generate the following two nested scheme trees. 

( Customer , Customer Name, Customer Addr, Discount 
(Order, OrderlD, OrderDate, 

(Item, SalePrice, Qty)*)*)* 

(Item, ItemNr, Description, Price, 

( Previousltem )*, (Manufacturer, Request DateTime, Qty)*)* 

Observe that the XML Schema in Figure 2 satisfies these nesting specifications. 
Item in the second scheme tree appears as an element on Line 8 with ItemNr, 
Description, and Price defined as its attributes on Lines 28-30. Previousltem 
is nested, by itself, underneath Item, on Line 18, and Manufacturer, Request- 
DateTime, and Qty are nested underneath Item as a group on Lines 13-15. 
The XML-Schema notation that accompanies these C-XML object-set names 
obscures the nesting to some extent, but this additional notation is necessary 
either to satisfy the syntactic requirements of XML Schema or to allow us to 
specify the constraints of the C-XML model instance. 

As we continue, recall first that each C-XML object set has an associated 
data frame that contains specifications such as type declarations, value restric- 
tions, and any other annotations needed to specify information about objects in 
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<xs: element name="Document"> 

<xs : complexType> 

<xs: choice min0ccurs="0" maxOc cur s= "unbounded" > 

<xs: element ref="Customer"/> 

<xs: element name="Item"> 

<xs : complexType> 

<xs : sequence> 

<xs: element name="ItemMR" min0ccurs="0" max0ccurs="5"> 

<xs : complexType> 

<xs : attribute name=" Manufacturer" type="xs : string" use="required"/> 

<xs : attribute name="RequestDateTime" type="xs :date" use="required"/> 

<xs : attribute name="Qty" type="xs : posit ivelnteger" use="required"/> ... 

<xs: element name="PreviousItem" min0ccurs="0" maxOccurs="unbounded"> 

<xs : complexType> 

<xs : attribute name="ItemNr" type="xs : posit ivelnteger" use="required"/> ... 
<xs:keyref name="rl" ref er="ItemKey"> 

<xs: selector xpath="."/> 

<xs:field xpath="@ItemNr"/> ... 

<xs : attribute name="ItemNr" type="xs : posit ivelnteger" use="required"/> 

<xs : attribute name="Description" type="xs : string" use="required"/> 

<xs : attribute name="Price" type="xs : decimal" use="required"/> ... 

<xs:key name="0rderKey"Xxs : selector xpath=" . //0rder"/><xs : field xpath="@0rderID"/> ... 
<xs:key name="ItemKey"Xxs : selector xpath=" . //Item"/><xs : field xpath="@ItemNr"/> ... 
<xs: element name=" Customer" abstract="true"/> 

<xs : element name="Pref erredCustomer" substitutionGroup=" Customer "> 

<xs : complexType> 

<xs : group ref ="CustomerDetails"/> 

<xs : attribute name="Discount " type="xs : string" use="required"/> ... 

<xs : element name="RegularCustomer" substitutionGroup="Customer"> 

<xs : complexTypeXxs : group ref="CustomerDetails"/x/xs : complexType> . . . 

<xs : group name="CustomerDetails"> 

<xs : sequence> 

<xs: element name=" Customer Name" type="xs : string"/> 

<xs: element name=" Customer Addr" type="xs : string"/> 

<xs: element name="0rder" min0ccurs="0" maxOccurs="unbounded"> 

<xs : complexType> 

<xs : sequence> 

<xs: element name="0rderltem" min0ccurs="0" maxOccurs="unbounded"> 

<xs : complexType> 

<xs : attribute name="Qty" type="xs :positiveInteger" use="required"/> ... 
<xs:keyref name="r3" ref er="ItemKey"> 

<xs: selector xpath="."/> 

<xs:field xpath="@ItemNr"/> ... 

<xs : attribute name=" Order ID" type="xs :positiveInteger" use="required"/> 

<xs : attribute name="0rderDate" type="xs :date" use="required"/> ... 

Fig. 2. XML Schema Excerpt for the C-XML Model Instance in Figure 1. 



the object set. For our work here, we let the kind of information that appears 
in a data frame correspond exactly to the kind of data constraint information 
specifiable in XML Schema. One example we point out explicitly is order infor- 
mation, which is usually absent in conceptual models, but is unavoidably present 
in XML. Thus, if we wish to say that CustomerName precedes CustomerAddr, 
we add the annotation “-< Customer Addr" to the CustomerName data frame 
and add the annotation “>- Customer Name" to the CustomerAddr data frame. 
In our discussion, we assume that these annotations are in the data frames that 
accompany the object sets CustomerName and CustomerAddr in Figure 1. 

Our conversion algorithm preserves all annotations found in C-XML data 
frames. This is where we obtain all the type specifications in Figure 2. We cap- 
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ture the order specification, Customer Name -< Customer Addr, by making Cus- 
tomerName and CustomerAddr elements (rather than attributes) and placing 
them, in order, in their proper place in the nesting - for our example in Lines 
58 and 59 nested under CustomerDetails. 

In the conversion from C-XML to XML Schema we use attributes instead 
of elements where possible. An object set can be represented as an attribute of 
an element if it is lexical, is functionally dependent on the element, and has no 
order annotations. The object sets OrderlD and OrderDate, for example, satisfy 
these conditions and appear as attributes of an Order element on Lines 75 and 
76. Both attributes are also marked as “ required ” because of their mandatory 
connection to Order as specified by the absence of an “o” on their connection 
to Order in Figure 1. 

When an object set is lexical but not functional and order constraints do 
not hold, the object set becomes an element with minimum and maximum par- 
ticipation constraints. Previousltem in Line 18 has a minimum participation 
constraint of 0 and a maximum of unbounded. 

Because XML Schema will not let us directly specify n-ary relationship sets 
(n > 2), we convert them all to binary relationship sets by introducing a tuple 
identifier. We can think of each diamond in a C-XML diagram as being replaced 
by a nonlexical object set containing these tuple identifiers. To obtain a name 
for the object set containing the tuple identifiers, we concatenate names of non- 
functionally dependent object sets. For example, given the n-ary relationship set 
for Order, Item, SalePrice, and Qty, we generate an Orderltem element (Line 
63) . If names become too long, we abbreviate using only the first letter of some 
object-set names. Thus, for example, we generate ItemMR (Line 11) for the 
relationship set connecting Item, Manufacturer, RequestDateTime, and Qty. 

When a lexical object set has a one-to-one relationship with a nonlexical 
object set, we use the lexical object set as a surrogate for the nonlexical object 
set and generate a key constraint. In our example, this generates key constraints 
for Order / OrderlD in Line 35 and Item / ItemNr in Line 39. We also use these 
surrogate identifiers, as needed, to maintain explicit referential integrity. Observe 
that in the scheme trees above, Item in the first tree references Item in the root 
of the second scheme tree and also that Previousltem in the second scheme tree 
is a role and therefore a specific specialization (or subset) of Item in the root. 
Thus, we generate keyref constraints, one in Lines 69-72 to ensure the referential 
integrity of ItemNr in the Orderltem element and another in Lines 22-25 for 
the Previousltem element. 

Another construct in C-XML we need to translate is generalization/specializ- 
ation. XML Schema uses the concept of substitution groups to allow the use of 
multiple element types in a given context. Thus, for example, we generate an 
abstract element for Customer in Line 44, but then specify in Lines 45-55 a sub- 
stitution group for Customer that allows Regular Customer and PreferredCus- 
tomer to appear in a Customer context. We model content that would normally 
be associated with the generalization by generating a group that is referenced in 
each specialization (in Lines 47 and 52). In our example, we generate the group 
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CustomerDetails and nest the details of Customer such as CustomerName, Cus- 
tomerAddr, and Orders under CustomerDetails as we do beginning in Line 56. 
Further, we can nest any information that only applies to one of the specializa- 
tions directly with that specialization; thus, in Line 48 we nest Discount under 
PreferredCustomer. 

Finally, XML documents need to have a single content root node. Thus, we 
assume the existence of an element called Document (Line 4) that serves as the 
universal content root. 

3.2 Translation from XML Schema to C-XML 

We translate XML Schema instances to C-XML by separating structural XML 
Schema concepts (such as elements and attributes) from non-structural XML 
Schema concepts (such as attribute types and order constraints). Then we gen- 
erate C-XML constructs for the structural concepts and annotate generated 
C-XML object sets with the non-structural information. 

We can convert an XML Schema S' to a C-XML model instance Cs by gener- 
ating object sets for each element and attribute type connected by relationship 
sets according to the nesting structure of S. Figure 3 shows the result of applying 
our conversion process to the XML Schema instance of Figure 2. Note that we 
nest object and relationship sets inside one another corresponding to the nested 
element structure of the XML Schema instance. Whether we display C-XML ob- 
ject sets inside or outside one another has no semantic significance. The nested 
structure, however, is convenient because it corresponds to the natural XML 
Schema instance structure. 

The initial set of generated object and relationship sets is straightforward. 
Each element or attribute generates exactly one object set, and each element 
that is nested inside another element generates a relationship set connecting 
the two. Each attribute associated with an element e always generates a cor- 
responding object set a and a relationship set r connecting a to the object set 
generated by e. Participation constraints for attribute-generated relationship 
sets are always 1..* on the a side and are either 1 or 0..1 on the e side. Partic- 
ipation constraints for relationship sets generated by element nesting require a 
bit more work. If the element is in a sequence or a choice, there may be specific 
minimum/maximum occurrence constraints we can use directly. For example, 
according to the constraints on Line 60 in Figure 2 a CustomerDetails element 
may contain a list of 0 or more Order elements. However, an Order element must 
be nested inside a CustomerDetails element. Thus, for the relationship set con- 
necting CustomerDetails and Order, we place participation constraints of 0..* 
on the CustomerDetails side, and 1 on the Order side. 

In order to make the generated C-XML model instance less redundant, we 
look for certain patterns and rewrite the generated model instance when appro- 
priate. For example, since ItemNr has a key constraint, we infer that it is one-to- 
one with Item. Further, the keyref constraints on ItemNr for Previousltem and 
Orderltem indicate that rather than create two additional ItemNr object sets, 
we can instead relate Previousltem and Orderltem to the ItemNr nested in Item. 
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Fig. 3. C-XML Model Instance Translated from Figure 2. 



Another optimization is the treatment of substitution groups. In our example, 
since RegularCustomer and PreferredCustomer are substitutable for Customer, 
we construct a generalization/specialization for the three object sets and fac- 
tor out the common substructure of the specializations into the generalization. 
Thus, CustomerDetails exists in a one-to-one relationship with Customer. 

Another complication in XML Schema is the presence of anonymous types. 
For example, the complex type in Line 5 of Figure 2 is a choice of 0 or more 
Customer or Item elements. We need a generalization/specialization to represent 
this, and since C-XML requires names for object sets, we simply concatenate all 
the top-level names to form the generalization name Customerltem. 

There are striking differences between the C-XML model instances of Fig- 
ures 1 and 3. The translation to XML Schema introduced new elements Doc- 
ument, CustomerDetails , Orderltem, and ItemMR in order to represent a top- 
level root node, generalization/specializations, and decomposed n-ary relation- 
ship sets. If we knew that a particular XML Schema instance was generated from 
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an original C-XML model instance, we could perform additional optimizations. 
For example, if we knew CustomerDetails was fabricated by the translation to 
XML Schema, we could observe that in the reverse translation to C-XML it 
is superfluous because it is one-to-one with Customer. Similarly, we could rec- 
ognize that Document is a fabricated top-level element and omit it from the 
reverse translation; this would also eliminate the need for Customerltem and its 
generalization/specialization. Finally, we could recognize that n-ary relationship 
sets have been decomposed, and in the reverse translation reconstitute them. 
The original C-XML to XML Schema translation could easily place annotation 
objects in the generated XML Schema instance marking elements for this sort 
of optimization. 

3.3 Information and Constraint Preservation 

To formalize information and constraint preservation for schema translations, we 
use first-order predicate calculus. We represent any schema specification in pred- 
icate calculus by generating an n-place predicate for each n-ary tuple container 
and a closed formula for each constraint [7]. Using the closed- world assump- 
tion, we can then populate the predicates to form an interpretation. If all the 
constraints hold over the populated predicates, the interpretation is valid. 

For any schema specification Sa of type A there is a corresponding valid 
interpretation I$ A . We can guarantee that a translation T translates a schema 
specification Sa to a constraint-equivalent schema specification Sb by checking 
whether the constraints of the generated predicate calculus for the schema spec- 
ification of type B imply the constraints of the generated predicate calculus for 
the schema specification of type A. A translation T that translates a schema 
specification Sa into a schema translation Sb induces a translation T' from an 
interpretation Ig A for a schema of type A to an interpretation Ig B for a schema of 
type B. We can guarantee that a T-induced translation T' translates any valid 
interpretation Ig A into an information equivalent valid interpretation Ig B by 
translating both of the corresponding valid interpretations to predicate calculus 
interpretations /gpc and Ignc an d checking for information equivalence. 

Definition 1. A translation T from schema specification Sa to a schema speci- 
fication S b preserves information if there exists a procedure P that for any valid 
interpretation I$ A corresponding to Sa computes Is A from Is B where Is B is the 
interpretation corresponding to Sb induced by T. □ 

Definition 2. A translation T from schema specification Sa to a schema spec- 
ification S b preserves constraints if the constraints of Sb imply the constraints 

of S A - O 

Lemma 1 . Let Is C -xml be a valid interpretation for a populated C-XML model 
instance Sc-xml- There exists a translation tc-XML that correctly represents 
I s c —xml as a valid interpretation ^sf c XML predicate calculus 1 . 

Due to space constraints, we have omitted all proofs in this paper. 
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Lemma 2. Let Is X MLSchema be an XML document that conforms to an XML 
Schema instance SxMLSchema ■ There exists a translation txMLSchema that cor- 
rectly represents Is X MLSchema as a valid interpretation Iqpc in predicate 

calculus. 

Theorem 1. Let T he the translation described in Section 3.1 that translates a 
C-XML model instance Sc -xml to an XML Schema instance SxMLSchema ■ T 
preserves information and constraints. 

Theorem 2. Let T he the translation described in Section 3.2 that translates an 
XML Schema instance SxMLSchema to a C-XML model instance Sc-xml- T 
preserves information and constraints. 

4 C-XML Views 

This section describes three types of views - simple views that help us scale up 
to large and complex XML schemas, query-generated views over a single XML 
schema, and query-generated views over heterogeneous XML schemas. 

4.1 High-Level Abstractions in C-XML 

We create simple views in two ways. Our first way is to nest and hide C-XML 
components inside one another [7]. Figure 3 shows how we can nest object sets 
inside one another. We can pull any object set inside any other connected object 
set, and we can pull any object set inside any connected relationship set so long 
as we leave at least two object sets outside (e.g. in Figure 1 we can pull Qty 
and/or SalePrice inside the diamond). Whether an object set appears on the 
inside or outside has no effect on the meaning. Once we have object sets on the 
inside, we can implode the object set or relationship set and thus remove the 
inner object sets from the view. We can, for example, implode Customer, Item, 
and PreferredCustomer in Figure 3, presenting a much simpler diagram showing 
only five object sets and two generalization/specialization components nested in 
Document. To denote an imploded object or relationship set, we shade the object 
set or the relationship-set diamond. Later, we can explode object or relationship 
sets and view all details. Since we allow arbitrary nesting, it is possible that 
relationship-set lines may cross object- or relationship-set boundaries. In this 
case, when we implode, we connect the line to the imploded object or relationship 
set and make the line dashed to indicate that the connection is to an interior 
object set. 

Our second way to create simple views is to discard C-XML components 
that are not of interest. We can discard any relationship set, and we can discard 
all but any two connections of an n-ary relationship set (n > 2). We can also 
discard any object set, but then must discard (1) any connecting binary rela- 
tionship sets, (2) any connections to n-ary relationship sets (n > 2), and (3) any 
specializations and relationship sets or relationship-set connections to these spe- 
cializations. Figure 4 shows an example of a high-level abstraction of Figure 1. In 
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Fig. 4. High-Level View of Customer/Order C-XML Model Instance. 



Figure 4 we have discarded Price and its associated binary relationship set, the 
relationship set for Previousltem , and the connections to RequestDateTime and 
Qty in the n-ary relationship set involving Manufacturer. We have also hidden 
OrderlD , OrderDate , and all customer information except CustomerName inside 
Order , and we have hidden SalePrice and Qty inside the Order-Item relation- 
ship set. Note that both the Order object set and the Order-Item relationship 
set are shaded, indicating the inclusion of C-XML components; that neither the 
Item object set nor the Item- Manufacturer relationship set are shaded, indi- 
cating that the original connecting information has been discarded rather than 
hidden within; and that the line between CustomerName and Order is dashed, 
indicating that CustomerName connects, not to Order directly, but rather to an 
object set inside Order. 

Theorem 3. Simple, high-level views constructed by properly discarding C-XML 
components are valid C-XML model instances. 

Corollary 1. Any simple, high-level view can be represented by an XML Schema. 

4.2 C-XML XQuery Views 

We now consider the use of C-XML views to generate XQuery views. As other 
researchers have pointed out [2,5], XQuery can be hard for users to understand 
and manipulate. One reason XQuery can be cumbersome is because it must 
follow the particular hierarchical structure of an underlying XML schema, rather 
than the simpler, logical structure of an underlying conceptual model. Further, 
different XML sources might specify conflicting hierarchical representations of 
the same conceptual relationship [2]. Thus, it is highly desirable to be able to 
construct XQuery views by generating them from a high-level conceptual model- 
based description. [5] describes an algorithm for generating XQuery views from 
ORA-SS descriptions. [2] also describes how to specify XQuery views by writing 
conceptual XPath expressions over a conceptual schema and then automatically 
generating the corresponding XQuery specifications. In a similar fashion, we can 
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define view CustomersByltemsOrdered 
{ for $item in Item 
return 
<Item> 

{$item/ItemNr, $item/Description} 

{ for $customer in $item/Order/Customer 
return 

< Customer > 

{Scustomer/CustomerName, Scustomer/CustomerAddr} 

{ for $order in $customer/Order, 

$item2 in $order/Item 
where $item2 — $item 
return 
<Order> 

{$order/OrderDate, $item2/Qty, $item2/SalePrice} 
</Order> 

} 

< / Customer > 

} 

</Item> 

} 



Fig. 5. C-XQuery View of Customers Nested within Items Ordered. 



generate XQuery views directly from high-level C-XML views. In some situations 
a graphical query language would be an excellent choice for creating C-XML 
views [9], but in keeping with the spirit of C-XML we define an XQuery-like 
textual language called C-XQuery. 

Figure 5 shows a high-level view written in C-XQuery over the model instance 
of Figure 1. We introduce a view definition with the phrase define view, and 
specify the contents of the view with FLWOR (for, let, where, order by, return) 
expressions [14]. The first for $item in Item phrase creates an iterator over 
objects in the Item object set. Since there is no top-level where clause, we iterate 
over all the items. Also, since C-XML model instances do not have “root nodes” 
the idea of context is different. In this case, Item defines the Item object set as 
the context of the path expression. For each such item, we return an <Item> ... 
</Item> structure populated according to the nested expressions. 

C-XQuery is much like ordinary XQuery, with the main distinguishing factor 
that our path expressions are conceptual, and so, for example, they are not con- 
cerned with the distinction between attributes and elements. Note particularly 
that for the data fields, such as ItemNr, CustomerN ame, and OrderDate, we 
do not care whether the generated XML treats them as attributes or elements. 
A more subtle characteristic of our conceptual path expressions is that since 
they operate over a flat C-XML structure, we can traverse the conceptual-model 
graph more flexibly, without regard for hierarchical structure. Thus, we general- 
ize the notion of a path expression so that the expression A//B designates the 
path from A to B regardless of hierarchy or the number of intervening steps in 
the path [9]. This can lead to ambiguity in the presence of cycles or multiple 
paths between nodes, but we can automatically detect ambiguity and require the 
user to disambiguate the expression (say, by designating an intermediate node 
that fixes a unique path). 
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define view RecentNitrogenFertilizerCustomers 
{ for $i in CustomersByltemsOrdered/Item 

where $i/Description = “Nitrogen Fertilizer” 

return 

<Customer> 

{ for $c in $i/Customer 

let $total := sum( for $o in $c/Order 

where $o/OrderDate > add-days(current-date(),-90) 
return $o/Qty * $o/SalePrice ) 

return 

{ $c/Customer N ame , Total=$total} 

} 

< / Customer > 

} 

for $c in RecentNitrogenFertilizerCustomers/Customer 
where $c/total > 300 

return 

<PotentialThreatCustomer> 

{ $c /Customer Name, $c/Total} 

< /PotentialThreatCustomer> 



Fig. 6. C-XQuery over the View of Customers Nested within Items Ordered. 



Given a view definition, we can write queries against the view. For the view in 
Figure 5, for example, the query in Figure 6 finds customers who have purchased 
more than $300 worth of nitrogen fertilizer within the last 90 days. To execute 
the query, we unfold the the view according to the view definition and minimize 
the resulting XQuery. See [13] for a discussion of the underlying principles. 

The view in Figure 6 illustrates the use of views over views. Indeed, appli- 
cations can use views as first-class data sources, just like ordinary sources, and 
we can write queries against the conceptual model and views over that model. 
In any case, we translate the conceptual queries to XQuery specifications over 
the XML Schema instance generated for the C-XML conceptual model. 

Theorem 4. A C-XQuery view Q over a C-XML model instance C can be trans- 
lated to an XQuery query Qc over an XML Schema instance Sc- 

Observe that by the definition of XQuery [14], any valid XQuery instance 
generates an underlying XML Schema instance. By Theorem 4, we thus know 
that for any C-XQuery view we retain a correspondence to XML Schema. In 
particular, this means we can compose views of views to an arbitrary depth and 
still retain a correspondence to XML Schema. 

4.3 XQuery Integration Mappings 

To motivate the use of views in enterprise conceptual modeling, suppose through 
mergers and acquisitions we acquire the catalog inventory of another company. 
Figure 7 shows the C-XML for this assumed catalog. We can rapidly integrate 
this catalog into the full inventory of the parent company by creating a mapping 
from the acquired company’s catalog to the parent company’s catalog. Figure 8 
shows such a mapping. In order to integrate the source (Figure 7) with the tar- 
get (Figure 1), the mapping needs to generate target names in the source. In 
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Fig. 7. C-XML Model Instance for the Catalog of an Acquired Company. 



define view CatalogltemToItem 
{ for $cltem in Catalogltem 

let SitemNr := CatalogNr-to-ItemNr($cItem) 

let $price := $cItem/MSRP * (1 + ScItem/MarkupPercent) 

return 

<Item> 

<ItemNr>{$itemNr}</ItemNr> 

< Description > { $cltem /ShortN ame } < /Description > 

<Price>{$price}</Price> 

</Item> 

} 

Fig. 8. C-XQuery Mapping for Catalog Integration. 

this example, Catalogltem , CatalogNr, and ShortName correspond respectively 
to Item , ItemNr , and Description. We must compute Price in the target from 
the MSRP and Markup Percent values in the source, as Figure 8 shows. We as- 
sume the function CatalogNr-t.o-ItemNr is either a hand-coded lookup table, or 
a manually-programmed function to translate source catalog numbers to item 
numbers in the target. The underlying structure of this mapping query corre- 
sponds directly to the relevant section of the C-XML model instance in Figure 1, 
so integration is now immediate. 

The mapping in Figure 8 creates a target-compatible C-XQuery view over 
the acquired company’s catalog in Figure 7. When we now query the parent com- 
pany’s items, we also query the acquired company’s catalog. Thus, the previous 
examples are immediately applicable. For example, we can find those customers 
who have ordered more than $300 worth of nitrogen fertilizer from either the 
inventory of the parent company or the inventory of the acquired company by 
simply issuing the query in Figure 6. With the acquired company’s catalog inte- 
grated, when the query in Figure 6 iterates over customer orders, it iterates over 
data instances for both Item in Figure 1 and Catalogltem in Figure 8. (Now, if 
the potential terrorist has purchased, say $200 worth of nitrogen fertilizer from 
the original company and $150 worth from the acquired company, the poten- 
tial terrorist will appear on the list, whereas the potential terrorist would have 
appeared on neither list before.) 

We could also write a mapping query going in the opposite direction, with 
Figure 1 as the source and Figure 7 as the target. Such bidirectional integration 
is useful in circumstances where we need to shift between perspectives, as is often 
the case in enterprise application development. This is especially true because 
all enterprise data is rarely fully integrated. 
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In general it would be nice to have a mostly automated tool for generat- 
ing integration mappings. In order to support such a tool, we require two-way 
mappings between both schemas and data elements. Sometimes we can use au- 
tomated element matchers [1, 12] to help us with the mapping. However, in other 
cases the mappings are intricate and require programmer intervention (e.g. cal- 
culating Price from MSRP plus a MarkupPercent or converting CatalogNr to 
ItemNr). In any case, we can write C-XQuery views describing each such map- 
ping, with or without the aid of tools (e.g. [11]), and we can compose these 
views to provide larger C-XQuery schema mappings. Of course there are many 
integration details we do not address here, such as handling dirty data, but the 
approach of integrating by composing C-XQuery views is sound. 

Theorem 5. A C-XQuery view Q over a C-XML model instance C of an ex- 
ternal, federated XML Schema can be translated to an XQuery query Qc over 
an XML Schema instance Sc- 

5 Concluding Remarks 

We have offered Conceptual-XML (C-XML) as an answer to the challenge of 
modern enterprise modeling. C-XML is equivalent in expressive power to XML 
Schema (Theorems 1 and 2). In contrast to XML Schema, however, C-XML pro- 
vides for high level conceptualization of an enterprise. C-XML allows users to 
view schemas at any level of abstraction and at various levels of abstraction in 
the same specification (Theorem 3), which goes a long way toward mitigating 
the complexity of large data sets and complex interrelationships. Along with C- 
XML, we have provided C-XQuery, a conceptualization of XQuery that relieves 
programmers from concerns about the often arbitrary choice of nesting and ar- 
bitrary choice of whether to represent values with attributes or with elements. 
Using C-XQuery, we have shown how to define views and automatically trans- 
late them to XQuery (Theorem 4). We have also shown how to accommodate 
heterogeneity by defining mapping views over federated data repositories and 
automatically translate them to XQuery (Theorem 5). 

Implementing C-XML is a huge undertaking. Fortunately, we have a foun- 
dation on which to build. We have already implemented tools relevant to C- 
XML include graphical diagram editors, model checkers, textual model compil- 
ers, a model execution engine, and several data integration tools. We are actively 
continuing development of an Integrated Development Environment (IDE) for 
modeling-related activities. Our strategy is to plug new tools into this IDE rather 
than develop stand-alone programs. Our most recent implementation work con- 
sists of tools for automatic generation of XML normal form schemes. We are 
now working on the implementation of the algorithms to translate C-XML to 
XML Schema, XML Schema to C-XML, and C-XQuery to XQuery. 
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Abstract. Reasoning on constraint sets is a difficult task. Classical 
database design is based on a step-wise extension of the constraint set 
and on a consideration of constraint sets through generation by tools. 
Since the database developer must master semantics acquisition, tools 
and approaches are still sought that support reasoning on sets of con- 
straints. We propose novel approaches for presentation of sets of func- 
tional dependencies based on specific graphs. These approaches may be 
used for the elicitation of the full knowledge on validity of functional 
dependencies in relational schemata. 



1 Design Problems During Database Semantics 
Specification and Their Solution 

Specification of database structuring is based on three interleaved and dependent 

parts [9]: 

Syntactics: Inductive specification of structures uses a set of base types, a col- 
lection of constructors and an theory of construction limiting the application 
of constructors by rules or by formulas in deontic logics. In most cases, the 
theory may be dismissed. 

Semantics: Specification of admissible databases on the basis of static integrity 
constraints describes those database states which are considered to be legal. 

Pragmatics: Description of context and intension is based either on explicit ref- 
erence to the enterprise model, to enterprise tasks, to enterprise policy, and 
environments or on intensional logics used for relating the interpretation and 
meaning to users depending on time, location, and common sense. 



P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 166-179, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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Specification of syntactics is based on the database modeling language. Specifi- 
cation of semantics requires a logical language for specification of classes of con- 
straints. Typical constraints are dependencies such as functional, multivalued, 
and inclusion dependencies, or domain constraints. Specification of pragmatics 
is often not explicit. The specification of semantics is often rather difficult due to 
the complexity. For this reason, it must be supported by a number of solutions 
supporting acquisition and reasoning on constraints. 



Prerequisites of Database Design Approaches 

Results obtained during database structuring are evaluated on two main criteria: 
completeness [8] of and unambiguity of specification. 

Completeness requires that all constraints that must be specified are found. 
Unambiguity is necessary in order to provide a reasoning system. Both criteria 
have found their theoretical and pragmatical solution for most of the known 
classes of constraints. Completeness is, however, restricted by the human ability 
to survey large constraint sets and to understand all possible interactions among 
constraints. 

Theoretical Approaches to Problem Solution: A number of normalization and re- 
structuring algorithms have been developed for functional dependencies. We 
do not know simple representation systems for surveying constraint sets and 
for detecting missing constraints beyond functional dependencies yet. 
Pragmatical Approaches to Problem Solution: A step-wise constraint acquisition 
procedure has been developed in [7,10,12]. The approach is based on the 
separation of constraints into: 

The set of valid functional dependencies £-\ : All dependencies that 
are known to be valid and all those that can be implied from the set 
of valid and excluded functional dependencies. 

The set of excluded functional dependencies £q: All dependencies 
that are known to be invalid and all those that are invalid and can 
be implied from the set of valid and excluded functional dependencies. 
This approach leads to the following simple elicitation algorithm: 

1. Basic step: Design obvious constraints. 

2. Recursion step: Repeat until the constraint sets £q and £ i do not change. 

— Find a functional dependency a that is neither in £\ nor in £q . 

If a is valid then add a to £\. If a is invalid then add a to £ o- 
— Generate the logical closures of £ o and £\. 

This algorithm can be refined in various ways. Elicitation algorithms know 
so far are all variation of this simple elicitation algorithm. 

However, neither the theoretical solutions nor the pragmatical approach provides 
a solution to problem 1: 

Define a pragmatical approach that allows simple representation of and reasoning 
on database constraints. 

This problem becomes more severe in association with the following problems. 
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Complexity of Semantics 

Typical algorithms such as normalization algorithms can only generate a correct 
result if specification is complete. Such completeness is not harmful as long as 
constraint sets are small. The number of constraints may however be exponential 
in the number of attributes [3]. Therefore, specification of the complete set of 
functional dependencies may be a task that is infeasible. This problem is 
closely related to another well-known combinatoric problem presented by Janos 
Demetrovics during MFDBS’87 [11] and that is still only partially solved: 

Problem 2. What is the size of sets of independent functional dependencies for an 
n-ary relation schema? 



Inter-dependence Within a Constraint Set 

Constraints such as functional dependencies are not independent from each 
other. Typical axiomatizations use rules such as the union, transitivity and path 
rules. Developers do not reason this way. Therefore, the impact of adding, delet- 
ing or modifying a constraint within a constraint set is not easy to capture. 
Therefore, we need a system for reasoning on constraint sets. 

Theoretical Approaches to Problem Solution: [14] and [1] propose to use a graph- 
based representation of sets of functional dependencies. This solution pro- 
vides a simple survey as long as constraints are simple, i.e., use singleton sets 
for the left sides. [13] proposes to use a schema architecture by developing 
first elementary schema components and constructing the schema by appli- 
cation of composition operations which use these components. [4] propose to 
construct a collection of interrelated lattices of functional dependencies. Each 
lattice represents a component of [13]. The set of functional dependencies is 
then constructed through folding of the lattices. 

Pragmatical Approaches to Problem Solution: [6] proposes to use a fact-based ap- 
proach instead of modeling of attributes. Elementary facts are ‘small’ objects 
that cannot be decomposed without loosing meaning. 

We, thus, must solve problem 3. 

Develop a reasoning system that support easy maintenance and development of 
constraint sets and highlight logical inter-dependence among constraints. 



Instability of Normalization 

Normalization is based on the completeness of constraint sets. This is impracti- 
cal. Altlrouth database design tools can support completeness, incompleteness of 
specification should be considered the normal situation. Therefore, normalization 
approaches should be robust with regard to incompleteness. 

Problem 4. [12] Find a normalization theory which is robust for incomplete con- 
straint sets or robust according to a class of changes in constraint sets. 
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Problems That Currently Defy Solution 

Dependency theory consists of work on about 95 different classes of dependencies, 
with a very few classes that have been treated together. Moreover, properties of 
sets of functional dependencies remain still unknown. 

In most practical cases several negative results obtained in the dependency 
theory do not restrict the common utilization of several classes. The reason for 
this is that the used constraint sets do not have these properties. Therefore, we 
need other classification principles for describing ‘real life’ constraint sets. 
Problem 5. [12] Classify ‘real life' constraint sets which can be easily maintained 
and specified. 

This problem is related to one of the oldest problems in database research 
expressed by Joachim Biskup in the open problems session [11] of MFDBS’87: 

Problem 6. Develop a specification method that supports consideration of sets of 
functional dependencies and derivation of properties of those sets. 



Outline of the Paper 1 and the Kernel Problem 
Behind Open Problems 

The six problems above can be solved on one common basis: 

Find a simple and sophisticated representation of sets of constraints 
that supports reasoning on constraints. 

This problem is infeasible in general. Therefore, we provide first a mechanism 
to reason on sets of functional dependencies defined on small sets of attributes. 
Geometrical figures such as polygons or tetralredra nicely support reasoning on 
constraints. Next we demonstrate the representation for attribute sets consist- 
ing of three or four attributes. Finally we introduce the implication system for 
graphical representations and show how these representations lead to a very 
simple and sophisticated treatment of sets of functional dependencies. 

2 Sets of Functional Dependencies 
for Small Relation Schemata 

2.1 Universes of Functional Constraints 

Besides functional dependencies (FDs), we use excluded functional constraints 
(also called negated functional dependencies ): eg. X —f^Y states that the func- 
tional dependency X — > Y is not valid. 

Treating sets of functional constraints becomes simpler if we avoid dealing 
with obviously redundant constraints. In our notation, a trivial constraint (a 
functional dependency or an excluded functional constraint) is a constraint with 

1 Due to the lack of space, this paper does not contain proofs or the representation 
of all possible sets of functional dependencies. All technical details as well as some 
other means of representation can be read in a technical report available under [5]. 
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at least one attribute of its left-hand side and right-hand side in common or 
has the empty set as its right-hand side. Furthermore, a canonical (singleton) 
functional dependency or a singleton excluded functional constraint has exactly 
one attribute on its right-hand side. We introduce the notations D, B + and B+ 
for the universes of functional dependencies, non-trivial functional dependencies 
and non-trivial canonical (singleton) functional dependencies, respectively, over 
a fixed underlying domain of attribute symbols. Similarly, E, E+ and E+ denote 
the universes of excluded functional constraints, non-trivial excluded constraints 
and non-trivial singleton excluded functional constraints (negated non-trivial, 
canonical dependencies) over the same set of attribute symbols, respectively. The 
traditional universe of functional constraints (including functional dependencies 
and excluded constraints) is DUE, while our graphical representations deal with 
sets of constraints over D+ U E+. In other words, the graphical representations 
we present in this paper deal with non-trivial canonical functional dependencies 
and non-trivial singleton excluded functional constraints only. It will be shown 
that we do not loose relevant deductive power applying this restriction to the 
universe of functional constraints. 

In most of the cases, we focus on closed sets of functional dependencies. A 
finite set T C D+ is closed (over D+ ) iff 7 + = A where 3^ + is the closure of 3y 
ie. T+ = {<5 <eD+ |TI=<5}. 

2.2 The Notion of Dimension 

For the classification of functional constraints and the attributes they refer to, we 
introduce the notion of dimension first. The dimension of a constraint is simply 
the size of its left-hand side, i.e. the number of attributes on its left-hand side. 

For a functional dependency X — * A G D+ denote by [X — > A] its dimension, 
defined as 

[X -> A] d = \X\ 

(the dimension of an excluded functional constraint can be defined similarly). 
For a single attribute A, given a set of functional dependencies T C D+, the 
dimension of A is denoted by [A]g? (or just simply [A]) and defined as 

[A] gr = f min IX I 

X^Ae3 r + 

This definition is extended with [A] y = f oo for the case when no X — > A exists 
in T + . The dimensions of attributes classify the sets of functional dependencies. 

2.3 Summary of the Number of Closed Sets 

Let n be the number of attributes of the considered relation schema. Denote 
by §©„ the set of closed sets of (singleton, non-trivial) functional dependencies 
for this n (with constant attributes disallowed). Defining r as the equivalence 
relation on these sets classifying them into different types or cases (for two 
equivalent sets there exists a permutation of attributes transforming one set to 
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another), the set of different classes is SD n /r. We are focusing on these different 
classes and the size of this set. Another possibility is to let the attributes to 
be stated as constants. Performing this extension to §T> n we get a larger set, 
denoted by §D° . The different cases (types) of functional dependency sets taking 
these zero-dimensional constraints into account form the set SD°/r. It can be 
easily verified that |SD° +1 /r| = |§D n+ i/r| + |§©°/r| holds for each n G N + . 

With these notations, Table 1 shows the number of closed sets of func- 
tional dependencies for unary, binary, ternary, quaternary and quinary relational 
schemata and demonstrates the combinatorial of the search space. 



Table 1 . Number of closed sets of functional dependencies for n attributes 
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165 


2 480 


184 
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U 480 


1 385 552 


14 664 



3 The Graphical Representation of Sets 
of Functional Dependencies 

There have been several proposals for graphical representation of sets of func- 
tional dependencies. Well-known books such as [1] and [14] have used a graph- 
theoretic notion. Nevertheless, these graphical notations have not made there 
way into practice and education. The main reason for this failure is the com- 
plexity of representation. Graphical representations are simple as long as the set 
of functional dependencies are not too complex. [2] has proposed a representa- 
tion for the ternary case based on either assigning an N-notation if nothing is 
known or assigning a 1-notation to an edge from X to Y at the Y end if the 
functional dependency X — » Y is valid. This representation is simple enough but 
already redundant in the case of ternary relationship types. Moreover, it is not 
generalizable to cases of n-ary relationship types with n > 3. 

We use a simpler notation which reflects the validity of functional dependen- 
cies in a simpler and better understandable fashion. 

We distinguish two kinds of functional dependencies for n = 3: 

One-dimensional (singleton left sides): Functional dependencies of the 
form {A} — > {B,C} can be decomposed to canonical functional dependen- 
cies {A} — > {B} and {A} — > {C}. They are represented by endpoints of 
binary edges (ID shapes) in the triangular representation. 
Two-dimensional (two-element left sides): Functional dependencies with 
two-element left-hand sides {A, B} — ■> {C'} cannot be decomposed. They are 
represented in the triangular (2D shape) on the node relating their right side 
to the corner. 
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Fig. 1. Triangular representation of sets of functional dependencies for the ternary case 



We may represent also candidates for excluded functional dependencies by 

crossed circles for the case that we know that the corresponding functional 
dependency is not valid in applications or by 

small circles for the case that we do not know whether the functional depen- 
dency holds or does not hold. 

We use now the following notations in the figures: 

Basic functional dependencies are denoted by filled circles. 

Implied functional dependencies are denoted by circles. 

Negated basic functional dependencies are either denoted by dots or by 
crossed filled circles. 

Implied negated functional dependencies are either denoted by dots or by 
crossed circles. 






Fig. 2. Examples of the triangular representation 



Figure 2 shows some examples of the triangular representation. The func- 
tional dependency {A} — > {B} and the implied functional dependency {A, C} — > 
{B} are shown in the left part. The functional dependencies {A} — > {5}, { B } — + 
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{C} and their implied functional dependencies are pictured in the middle trian- 
gle. The negated functional dependency {A, C} -h {B} and the implied negated 
functional dependencies {A} -h { B } and {C} -h {B} are given in the right 
picture. 

As mentioned above, the triangular representation can be generalized to 
higher number of attributes. Generalization can be performed in two directions: 
representation in a higher-dimensional space (3D in the case of 4 attributes, re- 
sulting the tetrahedral representation) or constructing a planar (2D, quadratic) 
representation. We use the same approach as before in the case of three at- 
tributes. An example is displayed in Figure 3 (implication is explained later). In 
this paper we concentrate on the 2D representation. 



A 





Fig. 3. The tetraherdal and quadratic representations of the set generated by { 13 — > 
C,B ->• D,B -> A, AD — ► B,AC -> B} 



This representation can be generalized to the case of 5 attributes. 



4 Implication Systems for the Graphical Representations 



Excluded functional constraints and functional dependencies are axiomatizable 
by the following formal system [12]. 

Axioms 

XY -*• Y 

Rules 



(1) 



X — » Y 
XVW — > YV 



( 2 ) 




z 



(3) 



x — > y , A' -i^z 

Y -j^Z 



(4) 



X -/-*• Y 
X -f+YZ 



(5) 



XZ -f-YYZ 
XZ -fYY 



X — ► Z , X -f-YYZ 
X —fXY 



Y — ► Z , X -j*Z 
X -f+Y 



(6) 



(7) 
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The universe of the extended Armstrong implication system is D U E (see 
Section 2.1) while our graphical and spreadsheet representations deal with sets 
of constraints over B+ U E+ 2 . However, the axiom and rules of the extended 
Armstrong implication system do not correspond to this restriction. It will be 
shown that an equivalent implication system can be constructed if these restric- 
tions are applied to the universe of constraints. We develop a new implication 
system for graphical reasoning: 



(5) Y ^ B 

( ’ YC — » B 


(T) 


Y -* A, Y A — > B 


(■ P ) 


YC' -h B 


Y -► B 


Y -h B 


Y —> A, Y -p> B 
YA-h B 


(R) 


Y A —> B, Y -fr B 
Y -h A 


(□) -(r- 


-*B,Y-t*B) 



The rules presented here can directly be applied for deducing consequences of a 
set of constraints given in terms of the graphical or spreadsheet representation. 
We use the following two implication systems: 



— the ST implication system over B+ with rules (S) and (T) and no axioms, 

— the PQRST implication system over B+ UE+ with all the presented rules and 
the symbolic axiom (□), which is used for indicating contradiction. 

These systems are sound and complete for deducing non-trivial, singleton con- 
straints. 

Theorem 1 The ST system is sound and complete over B+, ie. T Kst S <=A 
Sr 1= S for each finite subset T of D+ and <5 € D+. 

Theorem 2 Let T be a finite subset of D+ U E+ and S € D+ U E+ . 

The PQRST system without (□) is sound over D+ U E+ and complete with 
the restriction that 5F cannot be contradictory, ie. T b pqrst 6 fF 1= <5 

for each non-contradictory 3\ Moreover, -■(□) can be derived iff T is contra- 
dictory. 

The implication systems introduced above have the advantage of the exis- 
tence of a specific order of rules which provides a complete algorithmic method 
for getting all the implied functional dependencies and excluded functional con- 
straints starting with an initial set, allowing one to determine the possible types 
of relationships the initial set of dependencies defines. 

Theorem 3 3 

1. Let T and S be finite subsets of D+. If 1 Pgr S then all elements of S can 
be deduced starting with T by using the rules (S) and (T) the way that no 
application of (T) precede any application of (S). 

2 For example, X — > AB is represented as X — > A and X — > B. Excluded func- 
tional constraints with more than one attribute on their right-hand sides can not be 
eliminated this way. However, omitting these can also be achieved (see [5]). 

3 Proofs of the theorems are given in [5]. 
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2. If fF and S are finite subsets of B+ U E+ and IF I ~pqrst 5 then all elements 
of S can be deduced starting with T by using the rules (S), (T), (R), (P) 
and (Q) the way that no application of (T) precede any application of (S), 
no application of (R) precede any application of (T) and no application of 
(P) or (Q) precede any application of (R). Order of (P) and (Q) is arbitrary. 
Furthermore, (R) is needed to be applied at most once if |S| = 1. 

5 Graphical Reasoning 

Rules of the PQRST implication system support graphical reasoning. We will 
discuss first the case of n = 3. 




Fig. 4. Graphical versions of rules (S), (T) and (P), (Q), (R) 



Graphical versions of rules are shown on Figure 4 for the triangular repre- 
sentation (case Y = {C}). The small black arrows indicate support (necessary 
context) while the large grey arrows show the implication effects. Rule (S) is a 
simple extension rule and rule (T) can be called as “rotation rule” or “reduction 
rule” . We may call the left-hand side of a functional dependency the determinant 
of it and the right-hand side the determinate. Rule (S) can be used to extend the 
determinant of a dependency resulting another dependency with one dimension 
higher, while rule (T) is used for rotation , that is, to replace the determinate of 
a functional dependency by the support of another functional dependency with 
one dimension higher (the small black arrow at B indicates support of AC — > B) . 
Another possible way to interpret rule (T) is for reduction of the determinant 
of a higher-dimensional dependency by omitting an attribute if a dependency 
holds among the attributes of the determinant. 

For excluded functional constraints, rule (Q) acts as the extension rule (needs 
support of a positive constraint, ie. functional dependency) and (R) as the rota- 
tion rule (needs a positive support too). These two rules can also be viewed as 
negations of rule (T). Rule (P) is the reduction rule for excluded functional con- 
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straints, with the opposite effect of rule (Q) (but without the need of support). 
Rule (Q) is also viewed as the negation of rule (S). 

These graphical rules can be generalized to higher dimensional cases, where 
the number of attributes is more than 3. Figure 5 shows the patterns of rules (S) 
and (T) for the case n = 4. We use two or three patterns for a single case since 
we need a way to survey constraint derivation by (not completely symmetric) 
2D diagrams. We differentiate between the case that the rules (S) and (T) use 
functional dependencies consisting of singleton left sides and the case that the 
minimum dimension of functional dependencies is two. 



|Y|=1 



(S) 



7? 

<N 



:<<i 



(T) 













La _i 



*T 



|Y|=2 



ft 



'Ai 



i #Jd 




Fig. 5. Patterns of graphical rules (S) and (T) for the quadratic representation 



Theorem 3 in Section 4 shows that for positive dependencies, using (S) first 
as many times as possible and using (T) as many times as possible afterwards is 
a complete method for obtaining all non-trivial positive consequences of a given 
set of constraints. We may call it ST algorithm 4 . This can be extended for the 
case with excluded functional constraints. We now present it as an algorithm for 
FD derivation based on the graphical representation: 



The STRPQ Algorithm for Sets of Both Positive and Negative Con- 
straints. Rules (P), (Q) and (R) can be applied as complements of rules (S) 
and (T), resulting the following algorithm called STRPQ algorithm (based on 
part 2 of Theorem 3): 



4 With some modifications, this algorithm has been used for generating and counting 
all sets of functional dependencies (see Section 2.3) with a PROLOG program. 
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1. Starting with the given initial set of non-trivial, singleton functional depen- 
dencies and excluded functional constraints as input, 

2. extend the determinants of each dependency using rule (S) as many times 
as possible, then 

3. apply rule (T) until no changes occur, 

4. apply rule (R) until no changes occur, 

5. reduce and extend the determinants of excluded constraints using rules (P) 
and (Q) as many times as possible. 

6. Output the generated set. 

The algorithm just presented can be used for reasoning on sets of functional 
constraints, especially in terms of the graphical representations. The structure of 
the generalized triangular representations (2D-triangular, 3D-tetralredral, etc.) 
may also be used for designing a data structure representing sets of functional 
constraints for the algorithms. 

6 Applying Graphical Reasoning to Sets 
of Functional Dependencies 

Let us consider a more complex example discussed in [12]. We are given a part of 
the Berlin airport management database for scheduling flights and pilots at one 
of its airports. Flights depart at one departure time and to one destination. A 
flight can get different pilots and can be assigned to different gates each day of a 
week. In the given application we observe the following functional dependencies 
for the attributes Flight#, (Chief)Pilot#, Gate#, Day, Hour, Destination: 

{ Flight#, Day } — + { Pilot, Gate#, Hour } 

{ Flight# } — > { Destination, Hour } 

{ Day, Hour, Gate# } — » { Flight# } 

{ Pilot#, Day, Hour } — ■> { Flight# } 

As noticed in [12] we can model this database in a five very different ways. Figure 
6 displays one of the solutions. All types in Figure 6 are in the third normal form. 
Additionally, the following constraints are valid for solution in Figure 6: 
flies : { GateSchedule.Time, Pilot.# } — > { GateSchedule }. 

The two schemata have additionally transitive path constraints, e.g.: 

flies: { GateSchedule. Time, Day, Flight.# } — * { GateSchedule. Gate.# }. 
But the types are still in third normal form since for each functional dependency 
defined for the types X — » Y either A is a key or Y is a part of a key. 

The reason for the existence of rather complex constraints is the twofold usage 
of Hour. For instance, in our solution we find the equality constraint: 
hies. Flight. Hour = hies. GateSchedule. Time. Hour. 

We must know now whether the set of functional dependencies is complete. 
The combinatorial complexity of brute-force consideration of dependency sets is 
overwhelming . 

Let us now apply our theoretical findings to cope with the complexity and 
to reason on the sets of functional dependencies. We may use the following al- 
gorithm: 
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Hour 




Fig. 6. An extended ER Schema for the airline database with transitive path con- 
straints 

1. Consider attributes which are not used in any left side of a functional de- 
pendency whether they are really dangling. This is done by using the STRPQ 
algorithm with each of the attributes and the rest of other attributes. We may 
strip out dangling attributes not loosing the reasoning power. In the example we 
strip out Destination. 

2. Combine attributes to groups such that they appear together in left sides 
of functional dependencies. Consider first the relations among those attribute 
groups using the STRPQ algorithm. In our example we consider the groups 
(A) Day, Hour , (B) Flight#, Day, (C) (Chief (Pilot#, and (D) Gate#. The re- 
sult is shown on Figure 3. 

3. Recursively now apply the STRPQ algorithm to decompositions of attribute 
groups. 

The example shows how graphical reasoning can be directly applied to larger 
sets of attributes which have complex relations among them and can be expressed 
through functional dependencies. 

7 Conclusion 

The problem whether there exists a simple and sophisticated representation of 
sets of constraints that supports reasoning on constraints is solved in this paper 
by introducing a more surveyable means for the representation of constraint sets: 
the graphical representation. It requires a different implication system than the 
classical Armstrong system. We, thus, introduced another system and could show 
(Theorem 1 and 2) its soundness and completeness. 

This system has another useful property (Theorem 3): Constraint derivation 
may be ordered on the basis of sequences of rules. Derivation rule application 
can be described using the regular expression (S)*; (T)*; (f?)*; ((P)||(Q))*. This 
order of rule application is extremely useful whenever we want to know whether 
the set of generated functional constraints is full (closed), i.e. , consists of all 
(positive or both positive and negative) dependencies that follow from the given 
initial system of functional constraints. Based on this, we were able to generate 
all possible sets of initial functional dependencies for n = 3, 4, 5. 

Graphical reasoning supports a simpler style of reasoning on constraint sets. 
Completeness and soundness of systems of functional dependencies and excluded 
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functional dependencies becomes surveyable. Since database design approaches 
rely on completeness and soundness of constraint sets, our approach enables 
database designers to obtain better database design results. 
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Abstract. Despite the existence of well-known software sizing methods such as 
Function Point method, many developers still continue to use ad-hoc methods 
or so called “expert” approaches. This is mainly due to the fact that the existing 
methods require much implementation information that is difficult to identify or 
estimate in the early stage of a software project. The accuracy of ad-hoc and 
“expert” methods also has much problem. The entity-relationship (ER) model is 
widely used in conceptual modeling (requirements analysis) for data-intensive 
systems. From our observation, the characteristic of a data-intensive system, 
and therefore the source code of its software, is well characterized by the ER 
diagram that models its data. Based on this observation, this paper proposes a 
method for building software size model from extended ER diagram through 
the use of regression models. We have collected some real data from the indus- 
try to do a preliminary validation of the proposed method. The result of the 
validation is very encouraging. As software sizing is an important key to soft- 
ware cost estimation and therefore vital to the industry for managing their soft- 
ware projects, we hope that the research and industry communities can further 
validate the proposed method. 



1 Introduction 

Estimating project size is a crucial task in any software project. Overestimates may 
lead to the abortion of projects or loss of projects to competitors. Underestimates 
pressurize project teams and may also adversely affect the quality of projects. 

Despite the existence of well known software sizing methods such as Function 
Point method [1], [10] and the more recent Full Function Point method [7], many 
practitioners and project managers continue to produce estimates based on ad-hoc or 
so called “expert” approaches [2], [8], [15]. This is mainly due to the fact that exist- 
ing sizing methods require much implementation information that is not available in 
the earlier stage of a software project. However, the accuracy of ad-hoc and expert 
approaches also has much problem that results to questionable project budgets and 
schedules. 

The entity-relationship (ER) model originally proposed by Chen [5] is generally 
regarded as the most widely used tool for the conceptual modeling of data-intensive 
systems. An ER model is constructed to depict the ideal organization of data, inde- 
pendent of the physical organization of the data and where and how data are used. 

Indeed, much requirement of data-intensive systems is reflected from their ER 
models that depict their data conceptually. This paper proposes a novel method for 
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building software size model to estimate the size of source code for a data-intensive 
system based on extended ER diagram. It also discusses the validation effort con- 
ducted by us to validate the proposed method for building software size models for 
data-intensive systems written in Visual Basic and Java languages. 

The paper is organized as follows. Section 2 gives the background information of 
the paper. Section 3 discusses our observation and its rationale. Section 4 presents the 
proposed method for building software size models to estimate the sizes of source 
codes for data-intensive systems. Section 5 discusses our preliminary validation of 
the proposed method. Section 6 concludes the paper and compares the proposed 
method with related methods. 

2 Background 

Entity-relationship (ER) model was originally proposed by Chen [5] for data model- 
ing. And, it has been extended by Chen and others subsequently [17]. In this paper, 
we refer to the extended ER model that has the same set of concepts as the class dia- 
gram in terms of data modeling. In summary, the extended ER model uses the con- 
cept of entity, attribute and relationship to model the conceptual data for a problem. 
Each entity has a set of attributes each of which is an entity’s property or characteris- 
tic that is concerned by the problem. Relationships can be classified into three types: 
association, aggregation and generalization. 

There are four main stages in developing software systems: requirements capture, 
requirements analysis, design and implementation. The requirements are studied and 
specified in the requirements capture stage. They are realized conceptually in the 
requirements analysis. The design for implementing the requirements with the target 
environments taken into considerations is constructed in the design stage. In the im- 
plementation stage, the design is coded using the target programming language and 
the resulting code is tested to ensure its correctness. 

Though UML (Unified Modeling Language) has gained its popularity as a stan- 
dard software modeling language, many data-intensive systems are still developed in 
the industry through some form of data-oriented approach. In such an approach, some 
form of extended entity-relationship (ER) model is constructed to model the data 
conceptually in the requirements capture and analysis stages. And, the subsequent 
design and implementation activities are very much based on the extended ER model. 
For projects that use UML, a class diagram is usually constructed in the requirements 
analysis stage. Indeed, for a data intensive system, the class diagram constructed can 
be viewed as an extended ER model with the extension of behavioral properties 
(processing). Therefore, in the early stage of software development, some form of 
extended ER model is more readily available than information such as external in- 
puts, outputs and inquiries, and external logical files and external interface files that 
are required for the computation of function points. 

3 Our Observation 

Data-intensive systems constitute one of the largest domains in software. These sys- 
tems usually maintain large amount of structured data in a database built using a da- 
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tabase management system (DBMS). And. it provides operational, control and man- 
agement support to end-users through referencing and analyzing these data. The sup- 
port is usually accomplished through accepting inputs from user, processing inputs, 
updating databases, printing of reports, and providing inquiries to help users in the 
management and decision making processes. 

The proposed method for building software size model for data-intensive systems 
is based on our observation of these systems. Next, we shall discuss the observation 
and its rationale. 

The Observation: Under the same development environment (that is, a particular 
programming language and tool used), the size of source codes for a data-intensive 
system usually depends on the extended ER diagram that models its data. 

Rationale: The constituents of the data-intensive system can be classified into the 
following: 

1) Support business operations through accepting inputs to maintain entities model- 
ing in the ER diagram. 

2) Support decision making processes through producing outputs from information 
possessed by entities modeled in the ER diagram. 

3) Implement business logic to support the business operation and control. 

4) Reference to entities modeled in the ER diagram to support the first three con- 
stituents. 

Since the first two and the last constituents are based on the ER diagram, as such, 
they depend on the ER diagram. At the first glance, it seems that the third constituent 
may not depend on the ER diagram. However, since a data-intensive system usually 
does not perform complex computation within the source code (any complex compu- 
tation is usually achieved through calling pre-developed function), business logic in 
the source code is mainly for the navigation between entities via relationship types 
with simple computation. For example, for the business logic that if a customer has 
two overdue invoices, then no further orders will be processed, the source code for 
implementing the business logic retrieves overdue invoices in the Invoice entity type 
for the customer in the Customer entity type via the relationship type that associates a 
customer with its invoices. There is no complex computation involved. Therefore, it 
is reasonable to assume that usually, the implementation of business logic in a data- 
intensive system also depends on the ER diagram. This completes the rationale of the 
observation. □ 

4 The Proposed Software Sizing 

From the observation discussed in the previous section, the size of the source code for 
a data-intensive system usually depends and only depends on the structure and size of 
an extended ER diagram that models its data. Furthermore, ER diagram has been 
widely and well use in the requirements modeling and analysis stages. Thus, it is 
more suitable to base on extended ER diagram for the estimation of the size of source 
code for a data-intensive system. Therefore, we propose a novel method for building 
software size model based on extended ER diagram. This section discusses the 
method. 
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The proposed method builds software size models through well-known linear re- 
gression models. For a data-intensive system, the variables that sufficiently character- 
ize the extended ER diagram for the system form the independent variables. The 
dependent variable is the size of its source code in thousand lines of code (KLOC). 
Note that in this case, the extended ER diagram is implemented and is only imple- 
mented by the system. That is, the extended ER diagram and the system must coin- 
cide and have a one-to-one correspondence. As such, any source code that references 
or updates the database that is designed from the extended ER diagram must be in- 
cluded as part of the source code. 

In the proposed approach, a separate software size model should be built for each 
different development environment (that is, each programming language and tool 
used). For example, different software size models should be built for systems written 
in Visual Basic with VB Script and SQL, and systems written in Java with JSP, Java 
Script and SQL. In the most precise case, the independent variables that characterize 
the extended ER diagram comprise of the following: 

1) Total number of entity types. 

2) Total number of attributes. 

3) Total numbers of association types classified based on their degrees and multiplic- 
ities: Usually, the degrees can be classified in exact for those below an upper 
limit. The remaining can all be lumped into one. Multiplicities can be classified 
into zero-or-one, one and many. More precise classification can also be tried. 

4) Total numbers of aggregation types classified based on their degrees and multi- 
plicities: Same as the association types. 

5) Total numbers of generalization types classified based on the number of sub- 
classes: Usually, the number of sub-classes can be classified in exact for those be- 
low an upper limit. The remaining can all be lumped into one. 

However, we do not propose to build a software size model based on a fixed set of 
independent variables. It all depends on the kind of ER diagrams used in organiza- 
tions for which we develop the software size model. Note that the above-mentioned 
association refers to association that is not aggregation. The separation of relationship 
types into associations, aggregations and generalizations is because of the differences 
in their semantics. These differences may result to some differences in navigation and 
updating needs in the database. 

We propose that the independent variables should be defined according to the type 
of ER diagram constructed during the requirements modeling and analysis stages. So, 
at least the data required for software sizing is readily available in the early stage of 
requirements analysis. From our experience in building proposed software size mod- 
els using the data collected from the industry, hardly any relationship type is ternary 
or higher order. And, most of the ER diagrams do not classify their relationship types 
into association, aggregation and generalization. The precision of the independent 
variables depends on the types of extended ER diagram constructed in the require- 
ments modeling and analysis stages in the organization. However, a larger set of 
independent variables will require a larger set of data for building and evaluating the 
model. 
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The steps for building proposed software size models are as follows: 

1) Independent variables identification: Based on the type of data model (a class or 
ER diagram) constructed during requirements modeling and analysis, we identify 
a set of independent variables that sufficiently characterize the diagram. 

2) Data collection: Collect ER diagrams and sizes of source codes (in KLOC) of 
sufficient data-intensive systems. A larger set of independent variables will re- 
quire a larger set of data. There are many free tools available for the automated 
extraction of source code size. 

3) Model building and evaluation: There are quite a number of commonly used re- 
gression models [16]. Both linear and non-linear models can be considered. The 
size of source code (in KLOC) and the independent variables identified in the first 
step form the dependent and the independent variables respectively for the model. 
Statistical packages (e.g., SAS) should be used for the model building. Ideally, we 
should have separate data sets for modeling building and evaluation. However, if 
the data is limited, the same data set may also be used for model building and 
evaluation. Let n be the number of data points and k be the number of independent 
variables. Let y t and y, are the real and the estimate values respectively of a pro- 
ject. Let y be the mean of all y v The evaluation of model goodness can be done 
according to the examination of the following parameters: 

• Magnitude of relative error, MRE, and mean magnitude of relative error, 
MMRE: They are defined as follows: 



If the MMRE is small, then we have a good set of predictions. A usual crite- 
rion for accepting a model as good is that the model has a MMRE < 0.25. 

• Prediction at level / - Pred(/) - where / is a percentage: It is defined as the ratio 
of number of cases in which the estimates are within the l absolute limit of the 
actual values divided by the total number of cases. A standard criteria for con- 
sidering a model as acceptable is Pred(0.25) > 0.75. 

• Multiple coefficient of determination, R 2 , and adjusted multiple coefficient of 
determination, : These are some usual measures in regression analysis, de- 
noting the percentage of variance accounted for by the independent variables 
used in the regression equations. They are computed as follows: 

_ Explained var iablity _ SS yy — SSE _ ^ ssp 
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where sum of squared errors SSE = ^ (y, — y, ) 2 and SS yy = ^ (\y — y ) 2 . 

i = 1 i=l 

In general, the larger the value of R 2 and A 2 , the better fit of the data. A’ 2 = I 
implies a perfect fit of the model passing through every data point. However, 
R 2 can only be used as a measure to access the usefulness of the model if the 
number of data points is substantially more than the number of independent 
variables. 

If the same data set is used for both model building and evaluation, we can further 
examine the following parameters to evaluate the model goodness: 

• Relative root mean squared error, RMS , is defined as follows [6] : 

l ~ SSE 

\n-(k + 1) (5) 

RMS = — ^ - 

y 

n 

where SSE = — y, ) 2 .A model is considered acceptable if RMS < 0.25 . 

i=i 

• Prediction sum of squares, PRESS [16]: PRESS is a measure of how well the 

use of the fitted values for a subset model can predict the observed responses 
y r The error sum of squares, SSE = — >’ , ) 2 , is also such a measure. The 

PRESS measure differs from SSE in that each fitted value y i for the PRESS is 
obtained by deleting the i th case from the data set, estimating the regression 
function for the subset model from the remaining n - 1 cases, and then using 
the fitted regression function to obtain the predicted y Hi ^ for the ith case. That 

is, it is defined as follows: 

n 

press ^(y.-y.u)) 2 (6) 

i=t 

Models with smaller PRESS values are considered good candidate models. The 
PRESS value is always larger than SSE because the regression fit for the i th 
case is included. A smaller PRESS value supports the validity of the model 
built. 



5 Preliminary Validation 

As ER diagrams constructed in most projects in the industry do not classify relation- 
ship types into associations, aggregations and generalizations, a complete validation 
of the proposed method is not possible. We have spent much effort to pursue organi- 
zations in the industry to supply us their project data for the validation of the pro- 
posed software sizing method. As such, the whole validation took about one and a 
half year. This section discusses our validation. 
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Due to the above-mentioned constraint, the independent variables for characteriz- 
ing an ER diagram in our validation is simplified as follows: 

1) Number of entity types (E) 

2) Number of attributes (A) 

3) Number of relationship types ( R ) 

These variables provide a reasonable and concise characterization of the ER dia- 
gram. 

Our validation bases on the following linear regression models [14]: 

Size = fi 0 +fi l E + fi 2 R + fi 3 A (7) 

where Size is the total KLOC (thousand lines of code) of all the source code that is 
developed based on the ER diagram and |3 ( (()</'< 3) is a coefficient to be deter- 
mined. 

5.1 The Dataset 

We collected three datasets from multiple organizations in the industry including 
software house and end-users such as public organizations and insurance companies. 
These projects cover a wide range of application domains including freight manage- 
ment, administrative and financial systems. The first dataset comprises 14 projects 
that were developed using Visual Basic with VB Scripts and SQL. The second dataset 
comprises 10 projects that were developed using Java with JSP, Java Script and SQL. 
Table 1 and 2 show the details of the two data sets. The first and second datasets are 
for the building of software size models for the respective development environ- 
ments. The third dataset comprises of 8 projects developed using the same Visual 
Basic development environment as the first dataset. Table 3 shows the details of the 
third dataset. 

5.2 The Resulting Models 

From the Visual Basic based project data set (Table 1), the resulting first order model 
that we built for estimating the size of source code (in KLOC) developed using Vis- 
ual Basic with VB Script and SQL is as follows: 

Size = 6.788 - 0.062L + 2.169R - 0.007,4 (8) 

Adjusted multiple coefficient of determination R 2 a for this model is 0.84. The 
value of A’j is reckoned as good. 

From the Java based project data set (Table 2), the resulting first order model that 
we built for estimating the size of source code (in KLOC) developed using Java with 
JSP, Java Script and SQL is as follows: 

Size = 4.678 + 1.21 8L+ 0.028R + 0.023,4 (9) 

Adjusted multiple coefficient of determination for this model is 0.99 for this 
model. The value of R% is reckoned as very good. 
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Table 1. The VB based project dataset 



Project No. 


Actual Size 
(KLOC) 


E 


R 


A 


1 


42.1 


19 


25 


314 


2 


72 


36 


37 


540 


3 


29.52 


16 


16 


441 


4 


42.82 


38 


20 


779 


5 


16.73 


6 


5 


112 


6 


196.22 


64 


83 


1524 


7 


67.31 


29 


11 


351 


8 


52.76 


31 


24 


330 


9 


46.92 


29 


24 


201 


10 


37.54 


27 


8 


216 


11 


14.723 


8 


6 


203 


12 


24.667 


12 


10 


203 


13 


30.464 


27 


23 


764 


14 


109.659 


96 


63 


1471 


Table 2. The Java based project dataset 


Project No. 


Actual Size 
(KLOC) 


E 


R 


A 


1 


14.89 


7 


6 


40 


2 


12.62 


6 


5 


37 


3 


20.53 


10 


11 


75 


4 


23.68 


16 


19 


61 


5 


31.69 


19 


20 


91 


6 


20.45 


11 


11 


56 


7 


84.89 


49 


22 


889 


8 


104.539 


64 


167 


765 


9 


194.37 


125 


180 


1400 


10 


19.29 


9 


8 


62 



5.3 Model Evaluation 

For the first order model that we built for estimating size of source code (in KLOC) 
developed using Visual Basic with VB Script and SQL, we managed to collect a 
separate data set for the evaluation of the model. Note that R l for this model has 
already been computed during model building and is 0.84, which is reckoned as good. 
MMRE and Pred (0.25) computed from the evaluation data set are 0.16 and 0.88 
respectively. These values fall well within the acceptable level. The detailed results of 
the evaluation are shown together with the evaluation dataset in Table 3. Therefore, 
the evaluation results support the validity of the model built. 

For the first order model that we built for estimating the size of source code (in 
KLOC) developed using Java with JSP, Java Script and SQL, we did not manage to 
collect a separate data set for the evaluation of the model. As such, we used the same 
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data set for the evaluation. Note that R? a for this model has already been computed 
during model building and is 0.99, which is reckoned as very good. MMRE, Pred 
(0.25), SSE and PRESS computed from the same data set are 0.07, 1.00, 10.04 and 
556.84 respectively. The detailed results of the evaluation is shown in Table 4. Both 
MMRE and Pred(0.25) fall well within the acceptable level. Although there is a dif- 
ference between SSE and PRESS, the difference is not too substantial too. Note that 
RMS computed from SSE in this case is 0.02. If we replace SSE by PRESS in the 
computation of RMS , then the value of RMS is 0.18. Both of these values fall well 
below the acceptable level 0.25. Therefore, the evaluation results support the validity 
of the model built. 



Table 3. The VB based project evaluation dataset 



Project 

No. 


Actual Size 
(KLOC) 


E 


R 


A 


Estimated 

Size 


MRE 


1 


30 


13 


12 


209 


30.547 


0.02 


2 


13.3 


5 


4 


130 


14.244 


0.07 


3 


64.33 


20 


28 


553 


62.409 


0.03 


4 


59.02 


25 


32 


656 


70.054 


0.19 


5 


48.64 


16 


25 


126 


59.139 


0.22 


6 


117.03 


96 


68 


1718 


136.302 


0.16 


7 


23.58 


6 


5 


157 


16.162 


0.31 


8 


43.95 


15 


13 


167 


32.886 


0.25 




MMRE = 


0.16, Pred (0.25) 


= 0.88 







Table 4. The evaluation result of Java based model 



Project No. 


Actual Size 
(KLOC) 


Estimated Size 


MRE 


1 


14.89 


14.292 


0.04 


2 


12.62 


12.977 


0.03 


3 


20.53 


18.891 


0.08 


4 


23.68 


26.101 


0.10 


5 


31.69 


30.473 


0.04 


6 


20.45 


19.672 


0.04 


7 


84.89 


85.423 


0.01 


8 


104.539 


104.901 


0.00 


9 


194.37 


194.168 


0.00 


10 


17.29 


17.29 


0.00 


MMRE = 0.03, Pred (0.25) = 1.00, SSE = 10.04, 
PRESS = 556.84 



Though we managed to build only simplified software size models from the pro- 
posed approach due to the limitation in the industry practice, the evaluation results 
have already supported the validity of the models built. As such, our empirical valida- 
tion supports the validity of the proposed method for building software size models. 
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6 Comparative Discussion 

We have proposed a novel method for building software size models for data- 
intensive systems. Due to the lack of complete data for validating the proposed 
method from completed projects in the industry, we only managed to do a validation 
based on building and evaluating simplified proposed software size models. The 
statistical evaluation supports the validity of the proposed method. 

Due to the above-mentioned simplification and limited size of our dataset, we do 
not claim that the models built in this paper are ready for use. However, at least, we 
believe that our work has shown some promise to study the proposed method for 
software sizing further. Software size estimation is an important key to project esti- 
mation, which in turn is vital for project control and management [3], [4], [11]. There 
is much problem in existing software size estimation methods. As the software esti- 
mation community requires totally new datasets for the building and evaluation of 
software size models built using the proposed method, we call for collaborations 
between the industry and the research communities to validate the proposed method 
further and more comprehensively. From the history in establishing of Function Point 
method, without such effort, it is not likely to succeed in building usable software 
size model. 

As discussed in [15], most of the existing software sizing methods [9], [12], [13], 
[ 1 8] require much implementation information that is not available and is difficult to 
predict in the early stage of a software project. The information is not even available 
after the requirements analysis stage. It is only available in the design or implementa- 
tion stage. For example, Function Point method is based on external inputs, outputs 
and inquiries, and external logical files and external interface files. Such implementa- 
tion details are not even available at the end of requirements analysis stage. ER dia- 
gram has been well used in the conceptual modeling for developing data-intensive 
systems. Some proposals for software projects have also included ER diagrams as 
part of project requirement. As such, ER diagrams are at least practically available 
after the requirements analysis stage. Once the ER diagram is constructed, the pro- 
posed software size model can be applied without much difficulty. Therefore, in the 
worst case, we can apply the proposed approach after the requirements analysis stage. 
Ideally, a brief extended ER model should be constructed during the project proposal 
or planning stage. And, the proposed software size model can be applied to estimate 
the software size to serve as an input for project effort estimation. Subsequently, 
when a more accurate extended ER model is available, the model can be reapplied for 
more accurate project estimation. A final revision of project estimation should be 
carried out at the end of requirements analysis stage, in which an accurate extended 
ER diagram should be available. 

The well-known Function Point method is also mainly for data-intensive systems. 
As such, the domain of application for the proposed method for software sizing is 
similar to that of Function Point method. 
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Abstract. In Data Warehouse (DW) scenarios, ETL (Extraction, Trans- 
formation, Loading) processes are responsible for the extraction of data 
from heterogeneous operational data sources, their transformation (con- 
version, cleaning, normalization, etc.) and their loading into the DW. In 
this paper, we present a framework for the design of the DW back-stage 
(and the respective ETL processes) based on the key observation that 
this task fundamentally involves dealing with the specificities of infor- 
mation at very low levels of granularity including transformation rules 
at the attribute level. Specifically, we present a disciplined framework for 
the modeling of the relationships between sources and targets in differ- 
ent levels of granularity (including coarse mappings at the database and 
table levels to detailed inter-attribute mappings at the attribute level). 
In order to accomplish this goal, we extend UML (Unified Modeling 
Language) to model attributes as first-class citizens. In our attempt to 
provide complementary views of the design artifacts in different levels of 
detail, our framework is based on a principled approach in the usage of 
UML packages, to allow zooming in and out the design of a scenario. 

Keywords: data mapping, ETL, data warehouse, UML 



1 Introduction 

In Data Warehouse (DW) scenarios, ETL (Extraction, Transformation, Loading) 
processes are responsible for the extraction of data from heterogeneous opera- 
tional data sources, their transformation (conversion, cleaning, normalization, 
etc.) and their loading into the DW. DWs are usually populated with data from 
different and heterogeneous operational data sources such as legacy systems, re- 
lational databases, COBOL files, Internet (XML, web logs) and so on. It is well 
recognized that the design and maintenance of these ETL processes (also called 
DW back stage) is a key factor of success in DW projects for several reasons, the 
most prominent of which is their critical mass; in fact, ETL development can 
take up as much as 80% of the development time in a DW project [ 1, 2] . 

Despite the importance of designing the mapping of the data sources to 
the DW structures along with any necessary constraints and transformations, 
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unfortunately, there are few models that can be used by the designers to this end. 
The front end of the DW has monopolized the research on the conceptual part 
of DW modeling, while few attempts have been made towards the conceptual 
modeling of the back stage [3,4]. Still, to this day, there is no model that can 
combine (a) the desired detail of modeling data integration at the attribute 
level and (b) a widely accepted modeling formalism such as the ER model or 
UML. One particular reason for this, is that both these formalisms are simply 
not designed for this task; on the contrary, they treat attributes as second-class, 
weak entities, with a descriptive role. Of particular importance is the problem 
that in both models attributes cannot serve as an end in an association or any 
other relationship. 

One might argue that the current way of modeling is sufficient and there is no 
real need to extend it in order to capture mappings and transformations at the 
attribute level. There are certain reasons that we can list against this argument: 

— The design artifacts are acting as blueprints for the subsequent stages of the 
DW project. If the important details of this design (e.g., attribute interre- 
lationships) are not documented, the blueprint is problematic. Actually, one 
of the current issues in DW research involves the efficient documentation of 
the overall process. Since design artifacts are means of communicating ideas, 
it is best if the formalism adopted is a widely used one (e.g., UML or ER). 

— The design should reflect the architecture of the system in a way that is 
formal, consistent and allows the what-if analysis of subsequent changes. 
Capturing attributes and their interrelations as first-class modeling elements 
(FCME, also known as first-class citizens) improves the design significantly 
with respect to all these goals. At the same time, the way this issue is handled 
now would involve a naive, informal documentation through UML notes. 

— In previous lines of research [5], we have shown that by modeling attribute 
interrelationships, we can treat the design artifact as a graph and actually 
measure the aforementioned design goals. Again, this would be impossible 
with the current modeling formalisms. 

To address all the aforementioned issues, in this paper, we present an ap- 
proach that enables the tracing of the DW back-stage (ETL processes) particu- 
larities at various levels of detail, through a widely adopted formalism (UML). 
This is enabled by an additional view of a DW, called the data mapping dia- 
gram. In this new diagram, we treat attributes as FCME of the model. This gives 
us the flexibility of defining models at various levels of detail. Naturally, since 
UML is not initially prepared to support this behavior, we solve this problem 
thanks to the extension mechanisms that it provides. Specifically, we employ a 
formal, strict mechanism that maps attributes to proxy classes that represent 
them. Once mapped to classes, attributes can participate in associations that de- 
termine the inter-attribute mappings, along with any necessary transformations 
and constraints. We adopt UML as our modeling language due to its wide accep- 
tance and the possibility of using various complementary diagrams for modeling 
different system aspects. Actually, from our point of view, one of the main ad- 
vantages of the approach presented in this paper is that it is totally integrated 
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in a global approach that allows us to accomplish the conceptual, logical and the 
corresponding physical design of all DW components by using the same notation 
([ 6 - 8 ]). 

The rest of the paper is structured as follows. In Section 2, we briefly describe 
the general framework for our DW design approach and introduce a motivating 
example that will be followed throughout the paper. In Section 3, we show how 
attributes can be represented as FCME in UML. In Section 4, we present our 
approach to model data mappings in ETL processes at the attribute level. In 
Section 5, we review related work and finally, in Section 6 we present the main 
conclusions and future work. 



2 Framework and Motivation 

In this section we discuss our general assumptions around the DW environment 
to be modelled and briefly give the main terminology. Moreover, we define a 
motivating example that we will consistently follow through the rest of the paper. 

The architecture of a DW is usually depicted as various layers of data in which 
data from one layer is derived from data of the previous layer [9] . Following this 
consideration, we consider that the development of a DW can be structured into 
an integrated framework with five stages and three levels that define different 
diagrams for the DW model, as explained in Table 1. 

Table 1 . Data warehouse development framework 



— Phases: we distinguish five stages in the definition of a DW: 

• Source: it defines the data sources of the DW, such as OLTP systems, external data sources 
(syndicated data, census data), etc. 

• Integration: it defines the mapping between the data sources and the DW. 

• Data Warehouse: it defines the structure of the DW. 

• Customization: it defines the mapping between the DW and the clients’ structures. 

• Client: it defines special structures that are used by the clients to access the DW, such as 
data marts or OLAP applications. 

— Levels: each stage can be analyzed at three levels or perspectives: 

• Conceptual: it defines the DW from a conceptual point of view. 

• Logical: it addresses logical aspects of the DW design, such as the definition of the ETL 
processes. 

• Physical: it defines physical aspects of the DW, such as the storage of the logical structures 
in different disks, or the configuration of the database servers that support the DW. 

— Diagrams: each stage or level require different modeling formalisms. Therefore, our approach 
is composed of 15 diagrams, but the DW designer does not need to define all the diagrams 
in each DW project. In our approach, we use UML [10] as the modeling language, because its 
expressive power is sufficient for the modeling of all the diagrams of the framework. But as 
UML is a general modeling language, we need to use the UML extension mechanisms to adapt 
UML to specific domains. 



In previous works, we have presented some of the diagrams (and the cor- 
responding profiles), such as the Multidimensional Profile [6, 7] and the ETL 
Profile [4]. In this paper, we introduce the Data Mapping Profile. 

To motivate our discussion we will introduce a running example where the 
designer wants to build a DW from the retail system of a company. Naturally, we 
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consider only a small part of the DW, where the target fact table has to contain 
only the quarterly sales of the products belonging to the computer category, 
whereas the rest of the products are discarded. 

In Fig. 1, we zoom-in the definition of the SCS (Source Conceptual Schema), 
which represents the sources that feed the DW with data. In this example, 
the data source is composed of four entities represented as UML classes: Cities, 
Customers, Orders, and Products. The meaning of the classes and their attributes, 
as depicted in Fig. 1 is straightforward. The “...” shown in this figure simply 
indicates that other attributes of these classes exist, but they are not displayed 
for the sake of simplicity (this use of “...” is not a UML notation). 



Cities 




Customers 




Orders 




Products 


♦ crty_id 
+ city name 




♦ cust kJ 

♦ cust name 




♦ order id 

♦ cust_ id 




+ prod id 
+ prod name 
+ category 
+ price 


+ state 


1 0 ..* 


♦ cust_surname 

♦ city_id 


1 0 .* 


♦ prodjist 


0 ..' 1 ..* 

















Fig. 1. Source Conceptual Schema (SCS) 



Products ComputerSales 

I + prod jd + prod_id 

+ prod_name + quartered 

1 0..* + sales 0..* 1 

Fig. 2. Data Warehouse Conceptual Schema (DWCS) 

Finally, the DWCS (Data Warehouse Conceptual Schema) of our motivating 
example is shown in Fig. 2. The DW is composed of one fact (ComputerSales) 
and two dimensions (Products and Time). 

In this paper, we present an additional view of a DW, called the Data Map- 
ping that shows the relationships between the data sources and the DW and be- 
tween the DW and the clients’ structures. In this new diagram, we need to treat 
attributes as FCME of the models, since we need to depict their relationships 
at attribute level. Therefore, we also propose a UML extension to accomplish 
this goal in this paper. To the best of our knowledge, this is the first proposal of 
representing attributes as FCME in UML diagrams. 

3 Attributes as First-Class Modeling Elements in UML 

Both in the Entity-Relationship (ER) model and in UML, attributes are em- 
bedded in the definition of their comprising “element” (an entity in the ER or 
a class in UML), and it is not possible to create a relationship between two 
attributes. As we have already explained in the introduction, in some situations 
(e.g., data integration, constraints over attributes, etc.) it is desirable to repre- 
sent attributes as FCME. Therefore, in this section we will present an extension 
of UML to accommodate attributes as FCME. We have chosen UML instead of 



Time 
+ quartered 
+ quarter_start 
+ quarter_end 
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ER on the grounds of its higher flexibility in terms of employing complementary 
diagrams for the design of a certain system. 

Throughout this paper, we frequently use the term first-class modeling el- 
ements or first-class citizens for elements of our modeling languages. Concep- 
tually, FCME refer to fundamental modeling concepts, on the basis of which 
our models are built. Technically, FCME involve an identity of their own, and 
they are possibly governed by integrity constraints (e.g., relationships must have 
at least two ends refering to classes.). In a UML class diagram, two kinds of 
modeling elements are treated as FCME. Classes, as abstract representations of 
real-world entities are naturally found in the center of the modeling effort. Be- 
ing FCME, classes stand-alone entities also acting as attribute containers. The 
relationships between classes are captured by associations. Associations can be 
also FCME, called association classes. Even though an association class is drawn 
as an association and a class, it is really just a single model element [10]. An 
association class can contain attributes or can be connected to other classes. 
However, the same is not possible with attributes. 

Naturally, in order to allow attributes to play the same role in certain cases, 
we propose the representation of attributes as FCME in UML. In our approach, 
classes and attributes are defined as normally in UML. However, in those cases 
where it is necessary to treat attributes as FCME, classes are imported to the at- 
tribute/class diagram , where attributes are automatically represented as classes; 
in this way, the user only has to define the classes and the attributes once. In 
the importing process from the class diagram to the attribute/class diagram, we 
refer to the class that contains the attributes as the container class and to the 
class that represents an attribute as the attribute class. In Table 2, we formally 
define attribute/class diagrams, along with the new stereotypes, <CAttribute^> 
and <cContain^>. 

4 The Data Mapping Diagram 

Once we have introduced the extension mechanism that enables UML to treat 
attributes as FCME, we can proceed in defining a framework on its usage. In 



Table 2. Definitions 



Definition 1 Attribute classes are materializations of the Attributed stereotype, introduced 
specifically for representing the attributes of a class. The following constraints apply for the 
correct definition of an attribute class as a materialization of an <C Attribute^ stereotype: 

— Naming convention : the name of the attribute class is the name of the corresponding 
container class, followed by a dot and the name of the attribute. 

— No features : an attribute class can contain neither attributes nor methods. 

— Tag definitions: an attribute class contains the following tag definitions that represent 
the properties of an attribute model element (according to the UML Specification [10]): 
changeability, initialValue, multiplicty, ordering, ownerScope, property-string, stereotype, type, 
and visibility. 

Definition 2 A contain relationship is a composite aggregation between a container class and its 
corresponding attribute classes, originated at the end near the container class and highlighted 
with the <C Contain^ stereotype. 

Definition 3 An attribute/class diagram is a regular UML class diagram extended with 
<C Attribute^ classes and <^Containf^> relationships. 
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this section, we will introduce the data mapping diagram , which is a new kind 
of diagram, particularly customized for the tracing of the data flow, at various 
degrees of detail, in a DW environment. Data mapping diagrams are comple- 
mentary to the typical class and interaction diagrams of UML and focus on the 
particularities of the data flow and the interconnections of the involved data 
stores. A special characteristic of data mapping diagrams is that a certain DW 
scenario is practically described by a set of complementary data mapping dia- 
grams, each defined at a different level of detail. In this section, we will introduce 
a principled approach to deal with such complementary data mapping diagrams. 

To capture the interconnections between design elements, in terms of data, 
we employ the notion of mapping. Broadly speaking, when two design elements 
(e.g., two tables, or two attributes) share the same piece of information, possibly 
through some kind of filtering or transformation, this constitutes a semantic 
relationship between them. In the DW context, this relationship, involves three 
logical parties: (a) the provider entity (schema, table, or attribute), responsible 
for generating the data to be further propagated, (b) the consumer, that receives 
the data from the provider and (c) their intermediate matching that involves the 
way the mapping is done, along with any transformation and filtering. 

Since a data mapping diagram can be very complex, our approach offers the 
possibility to organize it in different levels thanks to the use of UML packages. 
Our layered proposal consists of four levels (see Fig. 3), as it is explained in 
Table 3. 



Table 3. Data mapping levels 



Database Level (or Level 0). In this level, each schema of the DW environment (e.g., data sources 
at the conceptual level in the SCS, conceptual schema of the DW in the DWCS, etc.) is rep- 
resented as a package [8]. The mappings among the different schemata are modeled in a single 
mapping package, encapsulating all the lower-level mappings among different schemata. 

Dataflow Level (or Level 1). This level describes the data relationship among the individual 
source tables of the involved schemata towards the respective targets in the DW. Practically, a 
data mapping diagram at the database level is zoomed-in to a set of more detailed data mapping 
diagrams, each capturing how a target table is related to source tables in terms of data. 

Table Level (or Level 2). Whereas the mapping diagram of the dataflow level describes the data 
relationships among sources and targets using a single package, the data mapping diagram at 
the table level, details all the intermediate transformations and checks that take place during 
this flow. Practically, if a data mapping is simple, a single package that represents the data 
mapping can be used at this level; otherwise, a set of packages is used to segment complex data 
mappings in sequential steps. 

Attribute Level (or Level 3). In this level, the data mapping diagram involves the capturing of 
inter-attribute mappings. Practically, this means that the diagram of the table is zoomed-in 
and the mapping of provider to consumer attributes is traced, along with any intermediate 
transformation and cleaning. As we shall describe later, we provide two variants for this level. 



At the leftmost part of Fig. 3, a simple relationship among the DWCS and 
the SCS exists: this is captured by a single Data Mapping package and these three 
design elements constitute the data mapping diagram of the database level (or 
Level 0). Assuming that there are three particular tables in the DW that we 
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would like to populate, this particular Data Mapping package abstracts the fact 
that there are three main scenarios for the population of the DW, one for each 
of this tables. In the dataflow level (or Level 1) of our framework, the data 
relationships among the sources and the targets in the context of each of the 
scenarios, is practically modeled by the respective package. If we zoom in one of 
these scenarios, e.g., Mapping 1, we can observe its particularities in terms of data 
transformation and cleaning: the data of Source 1 are transformed in two steps 
(i.e., they have undergone two different transformations), as shown in Fig. 3. 
Observe also that there is an Intermediate data store employed, to hold the output 
of the first transformation (Step 1), before passed on to the second one (Step 
2). Finally, at the right lower part of Fig. 3, the way the attributes are mapped 
to each other for the data stores Source 1 and Intermediate is depicted. Let us 
point out that in case we are modeling a complex and huge DW, the attribute 
transformation modelled at level 3 is hidden within a package definition, thereby 
avoiding the use of cluttered diagrams. 




Level 0 Level 1 Level 3 



Fig. 3. Data mapping levels 

The constructs that we employ for the data mapping diagrams at different 
levels are as follows: 

— The database and dataflow diagrams (Levels 0 and 1) use traditional UML 
structures for their purpose. Specifically, in these diagrams we employ (a) 
packages for the modeling of data relationships and (b) simple dependen- 
cies among the involved entities. The dependencies state that the mapping 
packages are dependent upon the changes of the employed data stores. 

— The table level (Level 2) diagram extends UML with three stereotypes: (a) 
<CMappingd, used as a package that encapsulates the data interrelationships 
among data stores, (b) <Clnputd and -cOutputd which explain the roles of 
providers and consumers for the -CMappingd. 

— The diagram at the attribute level (Level 3) is also using several newly intro- 
duced stereotypes, namely -CMapd, -cMapObjd, -cDomaind, <CRanged, 
<ClnpuO>, <gcOutputd, and “^Intermediated for the definition of data map- 
pings. 
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We will detail the stereotypes of the table level in the next section and defer 
the discussion for the stereotypes of the attribute level to subsection 4.2. 



4.1 The Data Mapping Diagram at the Table Level 

During the integration process from data sources into the DW, source data 
may undergo a series of transformations, which may vary from simple alge- 
braic operations or aggregations to complex procedures. In our approach, the 
designer can segment a long and complex transformation process into simple 
and small parts represented by means of UML packages that are materialization 
of a <CMapping^> stereotype and contain an attribute/class diagram. Moreover, 
<CMapping^> packages are linked by <Clnput^> and <cOutput^> dependencies 
that represent the flow of data. During this process, the designer can create in- 
termediate classes, represented by the -^Intermediate;^ stereotype, in order to 
simplify or clarify the models. These classes represent intermediate storage that 
may or may not exist actually, but they help to understand the mappings. 

In Fig. 4, a schematic representation of a data mapping diagram at the table 
level is shown. This level specifies data sources and target sources, to which 
these data are directed. At this level, the classes are represented as usually in 
UML with the attributes depicted inside the container class. Since all the classes 
are imported from other packages, the legend (from ...) appears below the name 
of each class. The mapping diagram is shown as a package decorated with the 
<CMapping^> stereotype and hides the complexity of the mapping, because a vast 
number of attributes can be involved in a data mapping. This package presents 
two kinds of stereotyped dependencies: <Clnput^> to the data providers (i.e. , the 
data sources) and <cOutput^> to the data consumers (i.e., the tables of the DW). 



4.2 The Data Mapping Diagram at the Attribute Level 

As already mentioned, in the attribute level, the diagram includes the relation- 
ships between the attributes of the classes involved in a data mapping. At this 
level, we offer two design variants: 

— Compact variant: the relationship between the attributes is represented as 
an association, and the semantic of the mapping is described in a UML note 
attached to the target attribute of the mapping. 

— Formal variant: the relationship between the attributes is represented by 
means of a mapping object, and the semantic of the mapping is described in 
a tag definition of the mapping object. 

With the first variant, the data mapping diagrams are less cluttered, with 
less modeling elements, but the data mapping semantics are expressed as UML 
notes that are simple comments that have no semantic impact. On the other 
hand, the size of the data mapping diagrams obtained with the second variant 
is larger, with more modeling elements and relationships, but the semantics are 
better defined as tag definitions. Due to the lack of space, in this paper we 
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will only focus on the compact variant. In this variant, the relationship between 
the attributes is represented as an association decorated with the stereotype 
<CMap^>, and the semantic of the mapping is described in a UML note attached 
to the target attribute of the mapping. 



DS1 

(from SC S) 

+ Attl 
+ Att2 
+ Att3 



DS2 

(from SCS) 

+ Attl 
+ Att2 
+ Att3 



«lnput» 



«lnput» 



«Mapping» 

Mapping 

diagram 



«lnput» 



«Output» 



Diml 

(from DWCS) 

+ Attl 
' + Att2 
+ Att3 



Fig. 4. Level 2 of a data mapping diagram 



The content of the package Mapping diagram from Fig. 4 is defined in the fol- 
lowing way (recall that Mapping diagram is a <cMapping3> package that contains 
an attribute/class diagram): 

— The classes DS1, DS2, . . . , and Diml are imported in Mapping diagram. 

— The attributes of these classes are suppressed because they are shown as 
<CAttribute^> classes in this package. 

— The <CAttribute^> classes are connected by means of association relationships 
and we use the navigability property to specify the flow of data from the data 
sources to the DW. 

— The association relationships are adorned with the stereotype <cMap^> to 
highlight the meaning of this relationship. 

— A UML note can be attached to each one of the target attributes to specify 
how the target attribute is obtained from the source attributes. The language 
for the expression is a choice of the designer (e.g., a LAV vs. a GAV approach 
[11] can be equally followed). 



4.3 Motivating Example Revisited 

From the DW example shown in Fig.s 1 and 2, we define the corresponding data 
mapping diagram shown in Fig. 5. The goal of this data mapping is to calculate 
the quarterly sales of the products belonging to the computer category. The 
result of this transformation is stored in ComputerSales from the DWCS. The 
transformation process has been segmented in three parts: Dividing, Filtering, and 
Aggregating; moreover, DividedOrders and FilteredOrders, two -^Intermediate^ 
classes, have been defined. 
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Fig. 5. Level 2 of a data mapping diagram 



Following with the data mapping example shown in Fig. 5, attribute prod list 
from Orders table contains the list of ordered products with product ID and 
(parenthesized) quantity for each. Therefore, Dividing splits each input order 
according to its prod I ist into multiple orders, each with a single ordered prod- 

uct (prod id) and quantity (quantity), as shown in Fig. 6. Note that in a data 
mapping diagram the designer does not specify the processes, but only the data 
relationships. We use the one-to-many cardinality in the association relationships 

between Orders. prod list and DividedOrders. prodid and DividedOrders. quantity 

to indicate that one input order produces multiple output orders. We do not 
attach any note in this diagram because the data are not transformed, so the 
mapping is direct. 
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Fig. 6. Dividing Mapping 



Filtering (Fig. 7) filters out products not belonging to the computer category. 
We indicate this action with a UML note attached to the prod id mapping, 
because it is supposed that this attribute is going to be used in the filtering 
process. 

Finally, Aggregating (Fig. 8) computes the quarterly sales for each prod- 
uct. We use the many-to-one cardinality to indicate that many input items 
are needed to calculate a single output item. Moreover, a UML note indicates 
how the ComputerSales. sales attribute is calculated from FilteredOrders. quantity 
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Fig. 8. Aggregating Mapping 



and Products, price. The cardinality of the association relationship between Prod- 
ucts. price and ComputerSales. sales is one-to-many because the same price is used 
in different quarters, but to calculate the total sales of a particular product in 
a quarter we only need one price (we consider that the price of a product never 
changes along time). 

5 Related Work 

There is a relatively small body of research efforts around the issue of conceptual 
modeling of the DW back-stage. 
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In [12,13], the model management, a framework for supporting meta-data 
related applications where models and mappings are manipulated is proposed. 
In [13], two scenarios related to loading DWs are presented as case studies: on the 
one hand, the mapping between the data sources and the DW, on the other hand, 
the mapping between the DW and a data mart. In this approach, a mapping is 
a model that relates the objects (attributes) of two other models; each object 
in a mapping is called a mapping object and has three properties: domain and 
range, which point to objects in the source and the target respectively, and expr, 
which is an expression that defines the semantics of that mapping object. This is 
an isolated approach in which authors propose their own graphical notation for 
representing data mappings. Therefore, from our point of view, there is a lack 
of integration with the design of other parts of a DW. 

In [3] the authors attempt to provide a first model towards the conceptual 
modeling of the DW back-stage. The notion of provider mapping among at- 
tributes is introduced. In order to avoid the problems caused by the specific 
nature of ER and UML, the authors adopt a generic approach. The static con- 
ceptual model of [3] is complemented in [5] with the logical design of ETL pro- 
cesses as clata-centric workflows. ETL processes are modeled as graphs composed 
of activities that include attributes as FCME. Moreover, different kinds of rela- 
tionships capture the data flow between the sources and the targets. 

Regarding data mapping, in [14] authors discuss issues related to the data 
mapping in the integration of data. A set of mapping operators is introduced 
and a classification of possible mapping cases is presented. However, no graphical 
representation of data mapping scenarios is provided, thereby making difficult 
using it in real world projects. 

The issue of treating attributes as FCME has generated several debates from 
the beginning of the conceptual modeling field [15]. More recently, some object- 
oriented modeling approaches such as OSM (Object Oriented System Model) 
[16] or ORM (Object Role Modeling) [17] reject the use of attributes ( attribute- 
free models) mainly because of their inherent instability. In these approaches, 
attributes are represented with entities (objects) and relationships. Although 
an ORM diagram can be transformed into a UML diagram, our data mapping 
diagram is coherently integrated in a global approach for the modeling of DW’s 
[6, 7], and particularly, of ETL processes [4]. In this approach, we have used the 
extension mechanisms provided by UML to adapt it to our particular needs for 
the modeling of DW’s. In this case, we always use formal extensions of the UML 
for modeling all parts of DWs. 

6 Conclusions and Future Work 

In this paper, we have presented a framework for the design of the DW back- 
stage (and the respective ETL processes) based on the key observation that 
this task fundamentally involves dealing with the specificities of information 
at very low levels of granularity. Specifically, we have presented a disciplined 
framework for the modeling of the relationships between sources and targets in 
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different levels of granularity (i.e., from coarse mappings at the database level to 
detailed inter-attribute mappings at the attribute level). Unfortunately, standard 
modeling languages like the ER model or UML are fundamentally handicapped 
in treating low granule entities (i.e., attributes) as FCME. Therefore, in order to 
formally accomplish the aforementioned goal, we have extended UML to model 
attributes as FCME. In our attempt to provide complementary views of the 
design artifacts in different levels of detail, we have based our framework on a 
principled approach in the usage of UML packages, to allow zooming in and out 
the design of a scenario. 

Although we have developed the representation of attributes as FCME in 
UML in the context of DW, we believe that our solution can be applied in 
other application domains as well, e.g., definition of indexes and materialized 
views in databases, modeling of XML documents, specification of web services, 
etc. Currently, we are extending our proposal in order to represent attribute 
constraints such as uniqueness or disjunctive values. 
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Abstract. We propose a requirements elicitation process for a data 
warehouse (DW) that identifies its information contents. These contents 
support the set of decisions that can be made. Thus, if the informa- 
tion needed to take every decision is elicited, then the total informa- 
tion determines DW contents. We propose an Informational Scenario 
as the means to elicit information for a decision. An informational sce- 
nario is written for each decision and is a sequence of pairs of the form 
< Query , Response >. A query requests for information necessary to 
take a decision and the response is the information itself. The set of 
responses for all decisions identifies DW contents. We show that infor- 
mational scenarios are merely another sub class of the class of scenarios. 



1 Introduction 

In the last decade, great interest has been shown in the development of data ware- 
houses (DWs). We can look at data warehouse development at the design, the 
conceptual, and the requirements engineering levels. Two different approaches 
for the development of DWs have been proposed at the design level. These are 
the data-driven [9], and the requirements-driven [2,12,8,19] approaches. Given 
data needs, these approaches identify the logical structure of the DW. 

Jarke et al. [11] propose to add a conceptual layer on top of the logical layer. 
Whereas they propose the basic notion of the conceptual layer, it is assumed that 
the conceptual objects represented in the Enterprise Model can be determined 
but the question of what are useful conceptual objects for a DW and how these 
are to be determined is not addressed. Thus, the conceptual level does not take 
into account the larger context in which the DW is to function. 

A relationship of the Data Warehouse to the organizational context is es- 
tablished at the requirements level. Fabio Rilson and Jaelson Freire [7] adapt 
traditional requirements engineering techniques to Data Warehouses. This ap- 
proach starts with Requirements Management Planning phase, for which the 
authors propose guidelines concerning acquisition, documentation and control 
of selected requirements. The second phase talks about a) Requirements Specifi- 
cation, which includes Requirements elicitation through, interviews, workshops, 
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prototyping and scenarios, b) Requirements Analysis and c) Requirements Docu- 
mentation. In third and fourth phases, requirements are conformed and validated 
respectively. 

The proposal of [7] is a “top” level proposal that builds a framework for DW 
requirements engineering. While providing pointers to RE approaches that may 
be applicable, this proposal does not establish their feasibility and also does not 
consider any detailed technical solutions. 

Boelmein et al. [3] presents a goal-driven approach that is based on the 
SOM (Semantic object Model) process modeling technique. It starts with the 
determination of two kinds of goals- one specifies the product and services to 
be provided whereas the other determines the extent to which the goal is to be 
pursued. Information requirements are derived by analyzing business processes 
in increasing details and by transforming relevant data structures of business 
processes into data structures of the data warehouse. According to [19], since 
data warehouse systems are developed to support exclusively decision processes, 
a detailed business process analysis is not feasible for decision processes because 
the respective tasks are unique and often not structured. Moreover, sometimes 
knowledge workers refuse to disclose their process in detail. 

The proposal of [14] aims to identify the decisional information to be kept 
in the Data Warehouse. This process starts with determination of the goals of 
an organization, uses these to arrive at its decision-making needs, and finally, 
identifies the information needed for the decisions to be supported. Therefore, the 
requirements engineering product is a Goal-Decision-Information (GDI) diagram 
that uses two associations 1) goal-decision coupling, and 2) decision- information 
coupling respectively. Whereas this proposal relates DW information contents 
to its decision-making capability as embedded in organizational goals, it is not 
backed up by a requirement elicitation process. 

In this paper, we look at requirement elicitation process for arriving at the 
GDI diagram. The total process is a two-part one. In the first part, the goal- 
decision coupling is elicited. That is, the set of decisions that can fulfill the goals 
of an organization are elicited. Thereafter, in the second part, from elicited 
decisions, the decision-information coupling can yield decisional information. 
Here, we deal with the second part of this process. 

We base our proposal on the notion of scenarios [13, 16, 6, 10, 18]. A scenario 
has been considered as a typical interaction between a user and the system to 
be developed. We treat this as the generic notion of a scenario. This is shown in 
Fig.l as the root node of the scenario typology hierarchy. We refer to traditional 
scenarios as transactional scenarios since they reveal the system functionality 
needed in the new system. We propose a second kind of scenarios called Data 
Warehouse scenarios. In consonance with our two-part process, for goal-decision 
coupling we propose decisional scenarios and for decision-information coupling 
we postulate informational scenarios (see Fig. 1). As mentioned earlier, our 
interest in this paper is in informational scenarios. 

Informational scenarios reveal the information contents of a system. An in- 
formational scenario represents a typical interaction between a decision-maker 
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Fig. 1. Scenario Typology. 



and the decisional system. This interaction is a sequence of pairs < Q,R >, 
where Q represents the query input to the system by the decision-maker and 
R represents the response obtained. This response yields the information to be 
kept in the decisional system, the data warehouse. 

In the next section we present the GDI model. In section 3, we define and 
illustrate informational scenarios. In subsection 3.1, we position them in the 
4-dimensional classification system of scenarios. In subsection 3.2, we show elic- 
itation of decisions from an informational scenario. The paper ends with a con- 
clusion. 



2 The GDI 

The Goal-Decision-information (GDI) model is shown in Fig. 2. In accordance 
with goal-orientation [1,4], we view a goal as an aim or objective that is to be 
met. A goal is a passive concept and unlike an activity/process/event it cannot 
perform or cause any action to be performed. A goal is set, and once so defined 
it needs an active component to realize it. The active component is decision. 
Further to fulfil the decisions appropriate information is required. 




Fig. 2. GDI Diagram. 
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As shown in Fig. 2 a goal can be either simple or complex. A simple goal 
cannot be decomposed into simpler ones. A complex goal is built out of other 
goals which may themselves be simple or complex. This makes a goal hierarchy. 
The component goals of a complex one may be mandatory or optional. 

A decision is a specification of an active component that causes goal 
fulfillment. It is not the active component itself: when a decision is selected for 
implementation then one or more actions may be performed to give effect to 
it. In other words, a decision is the intention to perform the actions 
that cause its implementation. Decision-making is an activity that results 
in the selection of the decision to be implemented It is while performing this 
activity that information to select the right decision is needed. As shown in 
Fig. 2, a decision can be either simple or complex. A simple decision cannot be 
decomposed into simpler ones whereas a complex decision is built out of other 
simple or complex decisions. Fig. 2 shows that there is an association l is satisfied 
by ’ between goals and decisions. This association identifies the decisions which 
when taken can lead to goal satisfaction. 

Knowledge necessary to take decisions is captured in the notion of decisional 
information shown in Fig. 2. This information is a specification of the data that 
will eventually be stored in the Data Warehouse. Fig. 2 shows that there is an 
association l is required for’ between decisions and decisional information. This 
association identifies the decisional information required to take a decision. 

An instance of the GDI diagram, the GDI schema is shown in Fig. 4. It shows 
a goal hierarchy (solid lines between ‘ Maximize profit ’ and ‘ increase the no. 
of customers’, and ‘ increase sales’ ) and a decision hierarchy (solid lines between 
l improve the quality of the product ’ and ‘ introduce statistical quality control tech- 
niques’ and l use better quality material’) for a given set of goals and decisions. 
The figure shows the l is satisfied by’ relationship between the goal 1 increase 
sales’ and decisions l open new sales counter’ and ‘ improve the quality of the 
product ’ by dashed lines. The l is required for’ relationship between decisions 
and associated information is shown by dotted lines. 

The dynamics of the interaction between goals, decisions and information is 
shown in Fig. 3. A goal suggests a set of decisions that lead to its satisfaction. A 
decision can be taken after consulting the information relevant to it and available 
in the decisional system. In the reverse direction, information helps in selecting 
a decision, which in turn satisfies a goal. 

For example the goal ‘ increase sales’ suggests the decisions 1 improve the 
quality of the product ’ and 1 open new sales counter ’. These decisions may modify 



Suggests Consults 




Satisfies Helps 



Fig. 3. The Interaction Cycle. 
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Fig. 4. A GDI Schema. 



the goal state. To implement the decision ‘open new sales counter ' it consults the 
information ''Existing product demand ’ and ‘ Existing service/ customer centers'. 

3 Informational Scenario 

In this section, we show elicitation of decisional information. The decision- 
information coupling suggests that the information needed to select a decision 
can be obtained from the knowledge of the decision itself. Thus, if we focus atten- 
tion on a decision then through a suitable elicitation mechanism, we can obtain 
the information relevant to it. Our informational scenario is one such elicitation 
mechanism. It can be seen that the informational scenario is an expression of 
the 'is required for' relationship between a simple decision and information of 
the GDI diagram(see fig. 2 ) 

An informational scenario is a typical interaction between the decision-maker 
and the decisional system. An informational scenario is written for each simple 
decision of the GDI diagram, and is a sequence of pairs < Q,R >, where Q 
represents the query input to decisional system and R represents the response 
of the decisional system. An informational scenario is thus of the form 

< Qi,Ri > 

< Q2, R2 > 



Qm Rn ^ 

The set of queries, Q\ through Q n , is an expression of the information relevant 
to the decision of the scenario. The information contents of the data warehouse 
can be derived from set of responses R\ through R n . We represent query Qi in 
SQL and a response Ri is represented as a relation (rf). 

Once a response has been received, it can be used in two ways (a) the relation 
attributes identify the information type to be maintained in the warehouse, and 
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(b) the tuple values can suggest the formulation of supplementary queries to elicit 
additional information. It is possible that, all values in all tuples may be non- null. 
Therefore, there is full knowledge of the data and a certain supplementary query 
sequence follows from this. We refer to such a < Q, R > sequence as a normal 
scenario (see Fig. 6). In case a tuple contains a null value then this ‘normal’ 
sequence will not be followed and the next query may be formulated to explore 
the null value more deeply. This results in the breaking of the normal thought 
process and results in a different sequence of < Q, R >. We call this sequence as 
an exceptional scenario. Fig. 5 shows these two types of informational scenarios. 

Let us illustrate the notion of normal and exceptional scenarios. Let us be 
given a decision “Open new sales counter ”, In order to make this decision, the 
decision maker would like to know the units sold for different products at the 
various sales counters of each region. After all, a new sales counter makes sense 
in an area where (a) units sold is so high that existing counters are overloaded 
or (b) in a region where units sold is very low and this could be merely due 
to the absence of a sales outlet. So, the first query is formulated to reveal this 
information: 

Q i : How many units of different products have been sold at various sales 
counters in each region? 

This query shows that Region, Sales counter, Product and Number of units sold 
must be made available in the data warehouse. 

Select regions, sales counter, product, units sold From sales, region 

Ri: Let the response be as follows: 



Regions 


Sales Counter 


Product 


Units Sold 


NR 


Null 


Radio 


Null 


NR 


Null 


TV 


Null 


SR 


Lata 


Radio 


90 


SR 


Lata 


TV 


200 


SR 


Lata 


Fridge 


200 


SR 


Kanika 


Radio 


Null 


SR 


Kanika 


TV 


110 


SR 


Kanika 


Fridge 


110 


ER 


Rubina 


Radio 


80 


ER 


Rubina 


TV 


250 


ER 


Rubina 


Fridge 


230 


CR 


Null 


Radio 


Null 


CR 


Null 


TV 


Null 



Let it be that the decision-maker is not interested in exploring ‘null’ for the 
moment. Instead, he wishes to see if unsold stock exists in some large quantity. 
If so, then the opening of a sales counter might help in clearing unsold stock. 
So, the decision-maker may asks for the number of units manufactured. If the 
manufactured quantity is not sold then he may think of opening new sales counter 
in a particular region. This query and its response is shown in Fig. 7a. This results 
in the normal scenario < Qi,Ri >, < Q 2 ,R 2 > shown in Fig. 6. 
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Fig. 5. An Information Scenario. 



<Q,, R,> 




Fig. 6. Normal and Exceptional Scenario. 



Suppose that the decision-maker wishes to explore ‘null’ values found in sales 
counter of regions. The reasoning followed is that if there are service centers 
in the regions NR and CR which are, in fact, servicing a number of company 
products then there is sufficient demand in these regions. This may call for 
the opening of sales counters. This query and its response is shown in Fig. 7b. 
The sequence < Q 3 ,R 3 > and any further < Q,R > pairs following from this 
constitute exceptional scenario shown in fig. 6. 

In fig. 7b, if the response R 3 contains null values for service centers for 
any region, and the decision-maker again wishes to explore ‘null’ values found in 
services center of regions. The reasoning followed is that if there are sales counter 
and no service centers in the region CR then to take the decision open new sales 
counter in CR, he may ask for the number of sales counter of other companies 
manufacturing the same products. This query and its response is shown in Fig. 
7c. The sequence < Q 4 ,R 3 > and any further < Q,R > pairs following from 
this constitute another exceptional scenario. It also shows that an exceptional 
scenario can lead to another exceptional scenario and so on. 

3.1 Positioning of Informational Scenario 

Here we show that an informational scenario is a subclass of the class of in the 
4-dimensional scenario classification framework proposed by [17]. 
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Q 2 : Provide number of units manufactured, number of units sold and product. 

Select units manufactured, product From Product 

R 2 : Let the response be as follows: 



Product 


No. of units manufactured 


TV 


700 


Radio 


700 


Fridge 


400 



Fig. 7a 



Q 3 : Provide number of service centers, number of customers in the regions where 
there are no sales counters. 

Select count (servicecenter) , count (customer) 

R 3 : Let the response be as follows: 



Regions 


No. of service Centers 


No. of Customers 


NR 


4 


100 


CR 


0 


40 



Fig. 7b 



Q 4 : Provide number of sales counter of different companies manufacturing same 
product in different regions. 

Select count(sales counter), company, region 
From manufacturing company 

R 4 : Let the response be as follows: 



Company 


Region 


Philips 


CR 


Philips 


NR 


Samsung 


CR 


Samsung 


NR 



No. of sales counter 
4 
2 
3 
2 



Fig. 7c 



The 4-dimensional framework considers scenarios along four different views, 
each view allowing to capture a particular relevant aspect of the scenarios. The 
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Form view deals with the expression mode of a scenario. The Contents view 
concern the kind of knowledge which is expressed in a scenario. The Purpose view 
is used to capture the role that a scenario is aiming to play in the requirements 
engineering process. The Lifecycle view suggests to consider scenarios as artifacts 
existing and evolving in time through the execution of operations during the 
requirements engineering process. 

A set of facets is associated with each view. Facets are considered as view- 
points suitable to characterize and classify a scenario according to this view. A 
facet has a metric attached to it. Each facet is measured by a set of relevant at- 
tributes. Table 1 shows the views, facets, attributes and possible values of these 
attributes in the 4-dimensional framework together with attribute values that 
our informational scenario takes on. Consider the level of formalism attribute of 
the Form view. This takes on the value Formal because of the use of SQL and 
relations in the scenario expression. It is possible to express a scenario less for- 
mally by using free format. Were such a scenario to exist, its level of Formalism 
would have the value Informal. 

Information scenario proposed by us is also characterized according to these 
four views. 



3.2 Elicitation of Decisions 

In this section we show that informational scenarios can help in eliciting decisions 
as well. These decisions are suggested by an analysis of < Q, R > sequence of 
the scenario. 

Let us consider the decision “ Open new sales counter ” again. Let the decision- 
maker makes a query as follows: 

Qi: What are the units sold for different products at various sales counters 
in each region? 

Ri: Let the response be as follows: 



Regions 


Sales Counter 


Product 


Units Sold 


SR 


Lata 


Radio 


30 


SR 


Lata 


TV 


100 


SR 


Lata 


Fridge 


90 


SR 


Kanika 


Radio 


25 


SR 


Kanika 


TV 


90 


SR 


Kanika 


Fridge 


90 


ER 


Rubina 


Radio 


40 


ER 


Rubina 


TV 


120 


ER 


Rubina 


Fridge 


100 



The response shows that number of product units sold for different products 
is very low. Now the decision-maker may no longer be interested in continuing 
with the decision “ open new sales counter” any more. Since number of units sold 
is low, the decision-maker may now be interested in improving product sales. This 
leads to the elicitation of new decision 1 Improve Product Sales’. Informational 
scenario is now written out for this decision. 
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Table 1 . Positioning of Informational Scenario in 4-Dimensional Framework. 



View 


Facets 


Attributes 


Possible 


Informational 








values 


Scenario Values 


Form 


l)Description 


i) Level of Formalism 


a) Informal 


Formal 


View 




ii) Medium 


b) Semiformal 

c) Formal 

a) Text 

b) Graphics 

c) Images 

d) Video 

e) S/W Prototypes 


Text 




2) Presentation 


i)Animation 


a) Boolean 


False 






ii)Interactivity 


a) None 

b) Hypertext 
(refinement, 
exploration links, 
tracebility links) 

c) Advanced 


Refinement 


Content 


1) Abstraction 


i)Concrete 


a) Boolean 


True 


View 




ii)Abstract 


a) Boolean 


False 






iii) Mixed 


a) Boolean 


False 




2) Context 


i) System Internal 


a) Boolean 


False 






ii) System Interaction 


a) Boolean 


True 






iii) Organizational 


a) Boolean 


False 






context 










iv) Organizational 


a) Boolean 


False 






Environment 








3) Argumentation 


i)Positions 


a) Boolean 


True 






ii)Arguments 


a) Boolean 


False 






iii)Issues 


a) Boolean 


False 






iv)Decisions 


a) Boolean 


False 




4) Coverage 


i) Functional 


Set (Structure, 
Function, Behaviour) 


{ } 






ii)Non- Functional 


Set (Performance, 
Time constraints, 
etc.) 


{ > 






iii) Intentional Aspect 


Set (Goal, Problem 


<Decision, 








Cause etc.) 


Information> 


Purpose 


l)Role 


i) Descriptive 


a) Boolean 


True 


View 




ii) Exploratory 


a) Boolean 


False 






iii) Explanatory 


a) Boolean 


False 


Life 


l)Life span 


i)Life Span 


a)Transient 


b)Persistent 


Cycle 






Persistent 




View 


2)Operation 


i) Refinement 


a) Boolean 


False 






ii)Integration 


b) Boolean 


False 






iii) Expansion 


c) Boolean 


False 






iv)Delete 


d)Boolean 


False 






v) Capture 


e)From_scratch, 

By_reuse 


From_scratch 
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Thus it is possible to move in both directions in the decision-information 
coupling. An informational scenario is written is for a given decision, which may 
lead to elicitation of other decisions which, leads to informational scenarios. 

4 Conclusions 

Information Systems/ Software Engineering moved from early ‘code and fix’ ap- 
proaches through design to requirements engineering. Thus considerable explora- 
tion of the problem space is performed before implementation. We can see the 
same evolution in DW engineering: as mentioned in the Introduction, attempts 
have been made to introduce the design and conceptual layers. This evolution 
has the same expectations as before, namely, development systems that better- fit 
organisation needs and user requirements. Thus, we expect that today’s practice 
where analysts understand DW use after it has been implemented /used will give 
way to a systematic approach satisfying the various stakeholders. Analysts will 
understand DW use partly through the argumentation and reasoning process of 
requirements engineering and partly through the use of the prototyping process 
model. 

Just as traditional scenarios elicit the functional requirements of transac- 
tional sys-tems, informational scenarios elicit the informational requirements of 
decisional sys-tems. Both these belong to general class of scenarios and repre- 
sent typical interac-tions between the user and the system to be developed. In 
traditional scenarios the interest is in functional interaction: if the user does this 
then the system does that. In informational scenarios the interest is in obtaining 
information and we have an infor-mation seeking interaction: if I ask for this 
information, what do I get. 

Information may be missing or available. Depending upon this the user may 
for-mulate other information seeking interactions. We have used this to clas- 
sify scenarios as exceptional or normal. Finally, it is possible that informational 
scenarios may sug-gest new decisions, thus helping in decision elicitation. 

We are working on framing of guidelines for informational scenarios. Future 
work also concerns decision elicitation by exploiting the goal-decision coupling. 
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Abstract. Data Warehouses (DW), Multidimensional (MD) Databases, and On- 
Line Analytical Processing Applications are used as a very powerful mecha- 
nism for discovering crucial business information. Considering the extreme im- 
portance of the information managed by these kinds of applications, it is essen- 
tial to specify security measures from early stages of the DW design in the MD 
modeling process, and enforce them. In the past years, there have been some 
proposals for representing main MD modeling properties at the conceptual 
level. Nevertheless, none of these proposals considers security measures as an 
important element in their models, so they do not allow us to specify confiden- 
tiality constraints to be enforced by the applications that will use these MD 
models. In this paper, we discuss the confidentiality problems regarding DW’s 
and we present an extension of the Unified Modeling Language (UML) that al- 
lows us to specify main security aspects in the conceptual MD modeling, 
thereby allowing us to design secure DW’s. Then, we show the benefit of our 
approach by applying this extension to a case study. Finally, we also sketch 
how to implement the security aspects considered in our conceptual modeling 
approach in a commercial DBMS. 

Keywords: Secure data warehouses, UML extension, multidimensional model- 
ing, OCL 



1 Introduction 

Multidimensional (MD) modeling is the foundation of Data Warehouses (DW), MD 
Databases and On Line Analytical Processing Applications (OLAP). These systems 
are used as a very powerful mechanism for discovering crucial business information 
in strategic decision making processes. Considering the extreme importance of the 
information that a user can discover by using these kinds of applications, it is crucial 
to specify confidentiality measures in the MD modeling process, and enforce them. 

On the other hand, information security is a serious requirement which must be 
carefully considered, not as an isolated aspect, but as an element presented in all 
stages of the development lifecycle, from the requirement analysis to implementation 
and maintenance[4, 6]. To achieve this goal, different ideas for integrating security in 
the system development process are proposed [2, 8], but they only considered infor- 
mation security from a cryptographic point of view, and without considering database 
and DW specific issues. 

There are some proposals that try to integrate security into conceptual modeling. 
UMLSec [9], where UML is extended to develop secure systems, is probably the most 
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relevant one. This approach is very interesting, but it only deals with information 
systems (IS) in general, whilst conceptual database and DW design are not consid- 
ered. A methodology and a set of models have recently been proposed [5] in order to 
design secure databases to be implemented with Oracle9i Label Security (OLS) [11]. 
This approach, based on the UML, is important because it considers security aspects 
in all stages of the development process, from requirement gathering to implementa- 
tion. Together with the previous methodology, the proposed Object Security Con- 
straint Language (OSCL) [14], based on the Object Constraint Language (OCL) [19] 
of UML, allows us to specify security constraints in the conceptual and logical data- 
base design process, and to implement these constraints in a concrete database man- 
agement system (DBMS) such as OLS. Nevertheless, the previous methodology and 
models do not consider the design of secure MD models for DW’s. 

In the literature, we can find several initiatives to include security in DW [15, 16]. 
Many of them are focused on interesting aspects related to access control, multilevel 
security, its applications to federated databases, applications using commercial tools 
and so on. These initiatives refer to specific aspects that allow us to improve DW 
security in acquisition, storage, and access aspects. However, neither of them consid- 
ers the security aspects comprising all stages of the system development cycle nor 
considers security in the MD conceptual modeling. 

Regarding the conceptual modeling of DW’s, various approaches have proposed to 
represent main MD properties at the conceptual level (due to space constraints, we 
refer the reader to [1] for a detail comparison between the most relevant ones). These 
proposals provide their own non-standard graphical notations, and none of them has 
been widely accepted as a standard conceptual model for MD modeling. Recently, 
another approach [12, 18] has been proposed as an object-oriented conceptual MD 
modeling approach. This proposal is a profile of the UML [13], which uses its stan- 
dard extension mechanisms (stereotypes, tagged values and constraints). However, 
none of these approaches considers security as an important issue in their conceptual 
models, so they do not solve the problem of security in DW’s. 

In this paper, we present an extension of the UML (profile) that allows us to repre- 
sent main security information of data and their constraints in the MD modeling at the 
conceptual level. The proposed extension is based on the profile presented in [12] for 
the conceptual MD modeling because it allows us to consider main MD modeling 
properties as well as it is based on the UML (designers avoid learning a new specific 
notation or language). We consider the multilevel security model [17], but focusing 
on considering aspects regarding read operations because this is the most common 
operation for final user applications. This model allows us to classify both informa- 
tion and users into security classes, and enforce the mandatory access control [17], By 
using this approach, we are able to implement secure MD models with any commer- 
cial DBMS that is able to implement multilevel databases, such as OLS [11] or DB2 
Universal Database (UDB) [3]. 

The remainder of this paper is structured as follows: Section 2 briefly summarizes 
the conceptual approach for MD modeling in which we based on. Section 3 proposes 
the new UML extension for secure MD modeling. Section 4 presents a case study and 
apply our UML extension for secure MD modeling. Section 5 sketches some further 
implementation issues. Finally, Section 6 presents the main conclusions and intro- 
duces immediate our future work. 
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2 Object-Oriented Multidimensional Modeling 

In this section, we outline our approach, based on the UML [12, 18], for DW concep- 
tual modeling. This approach has been specified by means of a UML profile that 
contains the necessary stereotypes to represent all main features of MD modeling at 
the conceptual level [7], In this approach, structural properties are specified by a 
UML class diagram in which information is organized into facts and dimensions. 

Facts and dimensions are represented by means of fact classes and dimension 
classes respectively. Fact classes are defined as composite classes in shared aggrega- 
tion relationships of n dimension classes. The many-to-many relations between a fact 
and a specific dimension are specified by means of the multiplicity 1..* on the role of 
the corresponding dimension 
class. In our example in Fig. 1, 
we can see how the Sales fact 
class has a many-to-many 
relationship with the Product 
dimension. 

A fact is composed of 
measures or fact attributes. By 
default, all measures are con- 
sidered to be additive. For non- 
additive measures, additive 
rules are defined as constrains. 

Moreover, derived measures 
can also be explicitly re- 
presented (by /) and their 
derivation rules are placed 
between braces near the fact 
class. Our approach also allows 
the definition of identifying 
attributes in the fact class 
(stereotype OID). In this way 
degenerated dimensions can be 
considered [10], thereby re- 
presenting other fact features in 
addition to the measures for 
analysis. For example, we could store the ticket number ( ticket_number ) as 
degenerated dimensions, as reflected in Fig. 1. 

Regarding dimensions, each level of a classification hierarchy is specified by a 
base class (stereotype Base). An association of base classes specifies the relationship 
between two levels of a classification hierarchy. These classes must define a Directed 
Acyclic Graph (DAG) rooted in the dimension class (DAG constraint). The DAG 
structure can represent both multiple and alternative path hierarchies. Every base class 
must also contain an identifying attribute (OID) and a descriptor attribute 1 (stereotype 
D). These attributes are necessary for an automatic generation process into commer- 
cial OLAP tools, as these tools store this information on their metadata. 




Fig. 1. Multidimensional modeling using the UML 



A descriptor attribute will be used as the default label in the data analysis in OLAP tools. 



1 
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We can also consider non-strict hierarchies (an object at a hierarchy’s lower level 
belongs to more than one higher-level object) and complete hierarchies (all members 
belong to one higher-class object and that object consists of those members only). 
These characteristics are specified by means of the multiplicity of the roles of the 
associations and defining the constraint {completeness} in the target associated class 
role respectively. See Store dimension in Fig. 1 for an example of all kinds of classifi- 
cation hierarchies. Lastly, the categorization of dimensions is considered by means of 
the generalization / specialization relationships of UML. 

3 UML Extension for Secure Multidimensional Modeling 

The goal of this UML extension is to allow us to design MD conceptual models, but 
classifying the information in order to define which properties users must own to be 
entitled to access the information. Therefore, we have to consider three main stages: 

1 . Defining precisely the organization of the users that will have access to the MD 
system. We can define a precise level of granularity considering three ways of or- 
ganizing the users: Security hierarchy levels (which indicates the clearance level 
of the user), user Compartments (which indicates a horizontal classification of us- 
ers), and user Roles (which indicates a hierarchical organization of users accord- 
ing to their roles or responsibilities into the organization). 

2. Classifying the information into the MD model. We can define the security infor- 
mation for each element of the model (fact class, dimension class, etc.) by using a 
tuple composed of a sequence of security levels, a set of user compartments, and a 
set of user roles. We can also specify security constraints considering this security 
information. This security information and constraints indicate the security proper- 
ties that users must own to be able to access the information. 

3. Enforcing the mandatory access control (AC). The typical operations executed by 
final users in this type of systems are query operations. So, the mandatory access 
control has to be enforced for the read operations, whose access control rule is as 
follows: A user can access to an information only if, a) the security level of the 
user is greater or equal than the security level of the information, b) all the user 
compartments that have been defined for the information is owned by the user, 
and, c) at least one of the user roles defined for the information, is played by the 
user. 

In this paper, we will only focus on the second stage by defining a UML extension 
that allows us to classify the security elements in a conceptual MD model and to spec- 
ify security constraints. Furthermore, in Section 5, we sketch a prominent work to 
deal with the third stage by generating the needed structures in the target DBMS to 
consider all security aspects represented in the conceptual MD model. Finally, let us 
point out that the first stage concerns with security policies defined in the organization 
by managers, and it is out of the scope of this paper. 

We define our UML extension for secure conceptual MD modeling following the 
schema composed of these elements: description, prerequisite extensions, stereo- 
types/tagged values, well-formedness rules, and comments. For the definition of the 
stereotypes, we consider an structure that is composed of a name, the base metaclase, 
the description, the tagged values and a list of constraints defined by means of OCL. 
For the definition of tagged values, the type of the tagged values, the multiplicity, the 
description, and the default value are defined. 
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3.1 Description 

This UML extension reuses a set of stereotypes previously defined in [12], and de- 
fines new tagged values, stereotypes, and constraints, which enables us to define se- 
cure MD models. The 20 tagged values we have defined are applied to certain com- 
ponents that are specially particular to MD modeling, allowing us to represent them in 
the same model and in the same diagrams that describe the rest of the system. These 
tagged values will represent the sensitive information for the different elements of the 
MD modeling (fact class, dimension class, etc.), and they will allow us to specify 
security constraints depending on this security information and on the value of certain 
attributes of the model. The stereotypes will help us identify a special class that will 
define the profile of the system users. A set of inherent constraints are specified in 
order to define well-formedness rules. The correct use of our extension is assured by 
the definition of constraints in both natural language and OCL [19]. 
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Fig. 2. Extension of the UML with stereotypes 



Thus, we have defined 7 new stereotypes: one specializes in the Class model ele- 
ment, two specializes in the Primitive model element and four specialize in the Enu- 
meration model element. In Fig. 2, we have represented portions of the UML meta- 
model 2 to show where our stereotypes fit. We have only represented the specialization 
hierarchies, as the most important fact about a stereotype is the base class that the 
stereotype specializes. In these figures, new stereotypes are colored in a dark grey, 
whereas stereotypes we reuse from our previous profile [27] are in a light grey and 
classes from the UML metamodel remain white. 

3.2 Prerequisite Extensions 

This UML profile reuses stereotypes previously defined in another UML profile [12]. 
This profile provided the needed stereotypes, tagged values, constraints to accomplish 



2 All the metaclasses come from the Core Package, a subpackage of the Foundation Package. 
We based our extension on the UML 1.5 as this is the current accepted standard. To the best 
of our knowledge, the current UML 2.0 is not the final accepted standard yet. 
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the MD modeling properly, allowing us to represent main MD properties at the con- 
ceptual level. To facilitate the comprehension of the UML profile we present and use 
in this paper, we provide a brief description of the of these stereotypes in Table 1. 



Table 1 . Stereotype from the UML profile for conceptual MD modeling [12], 



Name 


Base Class 


Description 


Fact 


Class 


Classes of this stereotype represent facts in a MD model 


Dimension 


Class 


Classes of this stereotype represent dimensions in a MD model 


Base 


Class 


Classes of this stereotype represent dimension hierarchy levels in a MD 
model 


OID 


Attribute 


Attributes of this stereotype represent OID attributes of Facts, Dimensions 
or 

Base classes in a MD model 


Fact 

Attributes 


Attribute 


Attributes of this stereotype represent attributes of Fact classes in a MD 
model 


Descriptor 


Attribute 


Attributes of this stereotype represent descriptor attributes of Dimension or 
Base classes in a MD model 


Dimension- 

Attribute 


Attribute 


Attributes of this stereotype represent attributes of Dimension or Base 
classes in a MD model 


Comple- 

teness 


Association 


Associations of this stereotype represent the completeness of an association 
between a Dimension class and a Base class or between two Base classes 



3.3 Datatypes 

First of all, we need the definition of some new data types to be used in our tagged 
values definitions. The type Level (Fig. 3 (a)) will be the ordered enumeration com- 
posed by all security levels that have been considered (these values, tipically are un- 
classified, confidential, secret and top secret, but they colud be different). The type 
Levels (Fig. 3 (b)) will be an interval of levels composed by a lower level and an 
upper level. The type Role (Fig. 3 (c)) will represent the hierarchy of user roles that 
can be defined for the organization. The type Roles is a set of role trees or subtrees. 
The type Compartment (Fig. 3 (d)) is the enumeration composed by all user com- 
partments that have been considered for the organization. The type compartments is a 
set of user compartments. The type Privilege (Fig. 3 (e)) will be an ordered enumera- 
tion composed by all different privileges that have been considered (these values, 
typically are read, inserte, delete, update, and all). The type Attempt Fig. 3 (f) wille be 
an ordered enumeration composed by all different access attempt that have been con- 
sidered (these values are typically none, all, frustratedAttempt, sucessfullAccess, but 
they could be different. 
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Fig. 3. New Data types 



In Fig. 2 we can see the base classes these new stereotypes are specialized from. 
All the information surrounded in these new stereotypes has to be defined for each 
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MD model depending on its confidentiality properties, and on the number of users 
and complexity of the organization in which the MD model will be operative. Finally, 
we need some syntactic definitions that are not considered in the standard OCL. Par- 
ticularly, we need the new collection type Tree with its typical operations. 

3.4 Tagged Values 

In this section, we provide the definition of several tagged values for the model, 
classes, attributes, instances and constraints. 



Table 2. Stereotypes of the new data types. 



Tagged Values of the Model 


Name 


Type 


M a 


Description 


classes 


Set(OclType) 


1. * 


It specifies all classes of the model. This new tagged value is 
useful in order to navigate through all classes of the model 


securityLevels 


Sequence 

(Levels) 


1..* 


It specifies all security levels (ordered from less to more 
restrictive) that can be used by the model elements 


securityRoles 


Role 


0..* 


It specifies the hierarchical role structure that has been de- 
finedfor the organization. This type will be managed as a tree 


security- 

Compartments 


Set 

(Compartment) 


0..* 


It specifies the set of compartments that have been defined 
for the organization 


Tagged Values of the Class 


Name 


Type 


M 


Description 


SecurityLevels 


Levels 


1..* 


It specifies the interval of possible security level values, that 
aninstance of this class can receive. If the upper and lower 
security levels are the same, all instances will have the same 
security level. Otherwise, the concrete instance security level 
will be defined according to a security constraint 


SecurityRoles 


Set(Role) 


0..* 


It specifies a set of user roles. Each role is the root of a 
subtree of the general user role hierarchy defined for the 
organization. All instances of this class can have the same 
user roles, or maybe subtrees of the roles that have been 
defined for the class. A security constraint can decide the 
user roles for each instance according to the value of some 
attributes of the instance 


Security- 

Compartments 


Set 

(Compartment) 


0..* 


It specifies a set of compartments. All instances of this class 
can have the same user compartments, or a subset of them. A 
security constraint can decide the user compartments for each 
instance according to the value of some attribute of the 
instance 


LogType 


Attempt 


0..1 


It specifies whether the access has to be recorded: none, all 
access, only frustrated accesses, or only successful accesses 


LogCond 


OCLExpression 


0..1 


It specifies whether the access has to be recorded 


Involved- 

Classes 


Set(OclType) 


1..* 


It specifies the classes that have to be involved in a query to 
be enforced in an exception 


ExceptS ign 


{+,-} 


0..1 


It specifies if an exception permit (+) or deny (-) the access 
to instances of this class to a user or a group of users 


Except- 

Privilege 


Set(Privilege) 


1..* 


It specifies the privileges the user can receive or remove 


ExceptCond 


OCLExpression 


0..* 


It specifies the condition that users have to fulfill to be af- 
fectedby this exception 


Tagged Values of the Attribute 


Name 


Type 


M 


Description 


SecurityLevels 


Levels 


1..* 


Due to space constraints, we do not include the 
descriptions of the tagged values of attributes 
as they are similar to their counterpart tagged 
values of classes. 


SecurityRoles 


Set(Role) 


0..* 


Security-Compartments 


Set (Compartment) 


0..* 
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Table 2. (Continued) 



Tagged Values of the Instance 


Name 


Type 


M a 


Description 


Security Level 


Level 


1..* 


It specifies the security level of an instance 


SecurityRoles 


Set(Role) 


0..* 


It specifies a set of user roles for this instance. Each role is a 
subtree of the user role hierarchy defined for the organiza- 
tion. 


Security- 

Compartments 


Set 

(Compartment) 


0..* 


It specifies the set compartments for an instance 


Tagged Values of the Constraint 


Name 


Type 


M 


Description 


Involved- 

Classes 


Set(OCLType) 


0..1 


It specifies the classes, that are involved in a query, to be 
enforced in the constraint 



a M stands for Multiplicity 



Table 2 shows the tagged values of all elements in this extension. All default val- 
ues of security tagged values of the model are empty collections. On the other hand, 
the default value of security tagged values for each class is the less restrictive (the 
lower security level, the security role hierarchy that has been defined for the model 
and the empty set of compartments). The default value of the security tagged values 
for attributes is inherited from the class they belong. 

If we need to specify the situation in which accesses to the information of a class 
have to be recorded in a log file for future audit, we should use LogType and LogCond 
tagged values together in that class. By default, the value of LogType is none, so audit 
is not necessary by default. On the other hand, if we need to specify a security con- 
straint, we can use OCL and the InvolvedClasses tagged value to specify in which 
situation the constraint has to be enforced. By default, the value of this tagged value is 
the class to which the constraint is associated. Finally, if we need to specify a special 
security constraint in which a user/s (depending on a condition) can or cannot access 
to the corresponding class, independently of the security information of that class, we 
should use exceptions together with the following tagged values: InvolvedClasses, 
ExceptSign, ExceptPrivilege and ExceptCond. The default value of InvolvedClasses is 
the own class. The default value for ExceptSign is +, and for ExceptPrivilege is Read. 



3.5 Stereotypes 

By using all these tagged values, we can specify security constraints on a MD model 
depending on the values of attributes and tagged values. In this extension, we need to 
define one stereotype in order to specify other types of security constraints (Table 3). 
The stereotype UserProfile can be necessary to specify constraints depending on a 
particular information of a user or a group of users, e.g., depending on citizenship, 
age, etc. Then, the previously-defined data types and tagged values will be used on 
the fact, dimension and base stereotypes in order to consider other security aspects. 



3.6 Well-Formedness Rules 

We can identify and specify in both natural language and OCL constraints some well- 
formedness rules. These rules are grouped in Table 4. 
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Table 3. Stereotype UserProfile of our extension. 



Name 


UserProfile 


Base class 


Class 


Description 


Classes of this stereotype contain all the properties that the systems manage from users 


Constraints 


- This class has no associations to another classes 
Self.AssociationsEnd.size()=0 

- There is no more than one class of this type 
Context Model 

Inv self.classes->forAII(oclisTypeOf(UserProfile))->size()<=1 

- The name of a class of this stereotype will be UserProfile 
self.className = UserProfile 


Tagged Values 


None 


Icon 


i 



Table 4. Well-Formedness constraints. 



- Correct value of the tagged values: 


The security levels defined for each class of the model (fact, dimension, and base classes) and for each 
attribute of each class (OID, FactAttribute, Descriptor, and DimensionAttribute) has to belong to the 
sequence of security levels that has been defined for the model. 


context Model 

inv self.classes-> forAII(c | self.securityLevels -> includesAII(subSequence a (c.securityLevels.lowerLevel, 
c.securityLevels. upperLevel)) 

inv self.classes-> forAII(c | c.attributes-> forAII(a | self.securityLevels-> includesAII(subSequence(a.securityLevels.lowerl_evel, 
a.securityLevels. upperLevel))) 


The set of user roles defined for each class and attribute of the model has to be a subtree of the roles tree 
that has been defined for the model. 


context Model 

inv self.classes-> forAII(c | c.Roles-> forAII( r | self.Role->includesAII(r))) 

inv self.classes-> forAII(c | c.attributes-> forAII(a | a.Roles-> forAII (r | self.Role-> includesAII(r)))) 


The set of user compartments defined for each class and attribute of the model has to be a subset of the 
compartments that have been defined for the model. 


context Model 

inv self.classes-> forAII(c | c.Compartments-> forAII( comp | self.Compartments->includes(comp))) 

inv self.classes-> forAII(c j c.attributes-> forAII(a | a.Compartments-> forAII (comp | self.Compartments-> includes(comp)))) 


- The security information of instances: 


The security level of the instance of a class has to be included in the ranking of security levels that has 
been defined for the class. The same rule is applicable for the instances of attributes. 


context Model 

inv self.classes-> forAII(c | c.alllnstances -> forAII (i | self.securityLevels-> subSequence(c.securityLevels.lowerLevel, 
c.securityLevels.upperLevel)-> includes(i.securityLevel))) 


The user roles of an instance of a class, has to be subtress of the roles trees that have been defined for 
the class. The same rule is applicable for the instance of attributes. 


context Model 

inv self.classes-> forAII(c | c.alllnstances -> forAII (i | c.securityRoles-> includesAII(i.securityRoles))) 


The user compartments of an instance of a class, has to be a subset of the compartments that have been 
defined for the class. The same rule is applicable for the instance of attributes. 


context Model 

inv self.classes-> forAII(c | c.alllnstances -> forAII (i | i.securityCompartments-> includesAII(i.securityCompartments))) 


- Relationship between the security information of classes and its attributes: 


The security levels defined for an attribute have to be equal or more restricted that the security levels 
defined for its class. The same rule is applicable for the role hierarchies and user compartments. 


context Model 

inv self.classes-> forAII(c | c.attributes-> forAII(a | self.securityLevels-> subSequence(c.securityLevels.lowerLevel, 

a.securityLevels. upperLevel)-> includesAII(self.securityLevels-> 

subSequence(a.securityLevels.lowerLevel, a.securityLevels. upperLevel))) 

inv self.classes-> forAII(c | c.attributes-> forAII (a | c.securityRoles-> includesAII(a.securityRoles))) 

inv self.classes-> forAII (c | c.attributes-> forAII (a | c.securityCompartments-> includesAII(a.securityCompartments))) 



The type of the arguments of subSequen.ee collection is integer, but for the sake of readability, we con- 
sider that the arguments can be elements of the subsequence. The correct expression should be subSe- 
quence(self.securityLevels->indexOf(c.securityLevels.lowerLevel),self.securityLevels ->indexOf (c. secu- 
rity Levels. upperLevel). We consider this simplification in all uses of subsequence operation. 
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Table 4. (Continued) 



- Categorization of dimensions: 


When a dimension class is specialized in several base classes, the security levels of the subclasses have 
to be equal or more restrictive that the security levels of the superclass. The same rule is applicable for 
role hierarchies and user compartments. 


context Model 

inv self.classes-> forAII(c | c.subClasses-> forAII(s | self.securityLevels-> subSequence(c.securityLevels.lowerLevel, 
s.securityLevels.upperLevel)-> includesAII(self.securityLevels-> subsequence 
(s.securityLevels.lowerLevel.s.securityLevels.upperLevel)) 

inv self.classes-> forAII(c | c.subClasses-> forAII (s | c.securityRoles-> includesAII(s.securityRoles)) 

inv self.classes-> forAII (c | c.subClasses-> forAII (s | c.securityCompartments-> includesAII(s.securityCompartments))) 


- Classification hierarchies. As a general rule, we can consider that the more specific the information is, 
the more restricted its access is: 


If the class A has a 1..* association with the class B, means that information of A groups information of 
B, so B is more specific than A. The security level defined for the class B has to be more restrictive than 
the security level defined for the class A. This rule is also applicable for user roles and compartments. 


context Model 

inv self.classes-> forAII(c | c.associationEnd-> forAII (a | c.a.multiplicity>1 implies self.securityLevels-> 
subsequence(c.securityLevels.lowerLevel, a.securityLevels.upperLevel)-> includesAII(self.securityLevels-> 
subSequence(a.securityLevels.lowerLevel, a.securityLevels.upperLevel)) 

inv self.classes-> forAII(c | c.associationEnd-> forAII (a | c.a.multiplicity>1 implies c.securityRoles-> 
includesAII(a.securityRoles)) 

inv self.classes-> forAII(c | c.associationEnd-> forAII (a | c.a.multiplicity>1 implies c-securityCompartments-> includesAII 
(a.securityCompartments))) 


If the class A has a *..* association with the class B, the designer has to decide which class contains the 
most specific information. This well-formedness rule cannot be specified because it depends of design 
decisions. 


- Derived attributes: 


The security levels of a derived attribute have to be equal or more restricted than the attributes which it 
is based on. The same rule is applicable for user roles and compartments. By default, the derived 
attributed inherit the security information of the attribute it is based on. 


context Model 

inv self.classes-> forAII(c | c.attributes -> forAII(a | a. derived implies a.derivedFrom-> forAII (d | self.securityLevels-> 
subsequence(a.securityLevels.lowerLevel, d.securityLevels.upperLevel)-> includesAII(self.securityLevels-> 
subSequence(d.securityLevels.lowerLevel, d.securityLevels.upperLevel)) 

inv self.classes-> forAII(c | c.attributes -> forAII(a | a. derived implies a.derivedFrom-> forAII (d | d.securityRoles -> includesAII 
(a.securityRoles) 

inv self.classes-> forAII(c | c.attributes -> forAII(a | a. derived implies a.derivedFrom-> forAII (d | d.securityCompartments -> 
includesAII (a.securityCompartments) 


- Combination of dimensions: 


A query on the fact class has to consider the security information that has been defined for that class. 


A query that involves the combination of a dimension class (or maybe a base class) and a fact class has 
to consider the combination of the security information on the dimension (or base) class and on the fact 
class. The security levels of the combination will be the most restrictive from the security levels of the 
dimension (or base) class and the fact class. The same rule is applicable for the user roles and compart- 
ments. 


A query that involves the combination of several dimension classes, and the fact class, has to consider 
the combination of the security information of all classes. The security levels of the combination will be 
the most restrictive one from the security levels of all classes. The same rule is applicable for the user 
roles and compartments. 



3.7 Comments 

Many of the previous constraints are very intuitive, although we have to ensure its 
fulfillment, otherwise the system can be inconsistent. Moreover, the designer can 
specify security constraints with OCL. If the security information of a class or an 
attribute depends on the value of an instance attribute, it can be specified as an OCL 
expression (Fig. 4). Normally, security constraints defined for stereotypes of classes 
(fact, dimension and base) will be defined by using a UML note attached to the corre- 
sponding class instance. We do not impose any restriction on the content of these 
notes in order to allow the designer the greatest flexibility, only those imposed by the 
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tagged values definitions. The connection between a note and the element it applies to 
is shown by a dashed line without an arrowhead as this is not a dependency [13]. 

4 A Case Study Applying Our Extension for Secure MD Modeling 

In this section, we apply our extension to the conceptual design of a secure DW in the 
context of a reduced health-care system. The simplified hierarchy of the system user 
roles is as follows: HospitcilEmployees are classified into health and non-health users, 
health users can be Doctors or Nurses and non-health users can be Maintenance or 
Administrative. The defined security levels are unclassified, secret and topSecret. 

1 . Fig. 4 shows an MD model that includes a fact class ( Admission ), two dimensions 
( Diagnosis and Patient), two base classes ( Diagnosis_group and City), and a class 
( UserProfile). UserProfile class (stereotype UserProfile) contains the information 
of all users who will have access to this MD model. Admission fact class - 
stereotype Fact- contains all individual admissions of patients in one or more hos- 
pitals, and can be accessed by all users who have secret or top secret security lev- 
els -tagged value Security Lev els (SL) of classes-, and play health or administrative 
roles -tagged value SecurityRoles (SR) of classes-. Note that the cost attribute can 
only be accessed by users who play administrative role -tagged value SR of attrib- 
utes- Patient dimension contains the information of hospital patients, and can be 
accessed by all users who have secret security level -tagged value SL-, and play 
health or administrative roles -tagged value SR-. The Address attribute can only 
be accessed by users who play administrative role -tagged value SR of attributes-. 
City base class contains the information of cities, and it allows us to group patients 
by cities. Cities can be accessed by all users who have confidential security level - 
tagged value SL-. Diagnosis dimension contains the information of each diagnosis, 
and can be accessed by users who play health role -tagged value SR-, and have 
secret security level -tagged value SL-. Finally, Diagnosis_group contains a set of 
general groups of diagnosis. Diagnosis groups can be accessed by all users who 
have confidential security level -tagged value SLs-. 

Several security constraints have been specified by using the previously defined 
constraints, stereotypes and tagged values (the number of each numbered paragraph 
corresponds to the number of each note in Fig. 4): 

2. The security level of each instance of Admission is defined by a security constraint 
specified in the model. If the value of the description attribute of the Dia gno- 
sis _group which belongs to the diagnosis that is related to the Admission is cancer 
or AID, the security level -tagged value SL- of this admission will be top secret, 
otherwise secret. This constraint is only applied if the user makes a query whose 
the information comes from the Diagnosis dimension or Diagnosis_ group base 
classes together with the Patient dimension -tagged value involvedClasses- . 

3. The security level -tagged value SL- of each instance of Admission can also de- 
pend on the value of the cost attribute, which indicates the price of the admission 
service. In this case, the constraint is only applicable for queries that contain in- 
formation of the Patient dimension -tagged value involvedClasses-. 

4. The tagged value logType has been defined for the Admission class, specifying the 
value frustratedAttempts. This tagged value specifies that the system has to record, 
for future audit, the situation in which a user tries to access to information of this 
fact class, and the system denies it because of lack of permissions. 
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5. For confidentiality reasons, we could deny access to admission information to 
users whose working area is different than the area of a particular admission in- 
stance. This is specified by another exception in Admission fact class, considering 
tagged values involvedClasses, exceptSign and exceptCond. 

If patients are special users of the system, they could access to their own informa- 
tion as patients (e.g., for querying their personal data). This constraint is specified by 
using the excepSign and exceptCond tagged values in the Patient class. 




Fig. 4. Example of multidimensional model with security information and constraints 3 



5 Implementation 

Oracle9i Label Security [11] allows us to implement multilevel databases. It defines 
labels that are assigned to the rows and users of a database that contain confidentiality 
information and authorization information for rows and users respectively. Moreover, 
OLS allows us to specify labeling functions and predicates that are triggered when an 
operation is executed, and which define the value of security labels. 

A secure MD model can be implemented by OLS. The two main security elements 
that we include in this UML extension are confidentiality information of data, and 
security constraints. The basic concepts of a MD model (facts, dimension and base 
classes) are implemented as tables in a relational database. The security information 



3 



Version 2 of OCL considers a special syntaxis for enumerations (EnumTypeName::Enum Literal- 
Value), but in this example, for the sake of readability, we consider only EnumLiteral Value. 
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of the MD model can be implemented by the security labels of OLS, and the security 
constraints can be implemented by labeling functions and predicates of OLS. 

For instance, we could consider the table Admission with CodeAdmission, Type, 
Cost, CodeDiagnosis and PatientSSN columns. This table will have a special column 
to store the security label for each instance. For each instance, this label will contain 
the security information that has been specified in Fig. 4 (Security LeveI=Secret.. 
TopSecret; SecurityRoles=Health, Admin). But this security information depends on 
several security constraints that can be implemented by labeling functions. Table 5 (1 ) 
shows an example in which we implement the security constraints labeled with num- 
ber 2 in Fig. 4. If the value of Cost column is greater than 10000 then the security 
label will be composed of TopSecret security level and Health and Admin user roles, 
else the security label will be composed of Secret security level and the same user 
roles. Table 5 (2) shows how to link this labeling function with Admission table. 



Table 5. Example of labeling function in OLS. 



(1) CREATE FUNCTION Which_Cost (Cost: Integer) Return LBACSYS.LBAC_LABEL 
As MyLabel varchar2(80); 

Begin 

If Cost>10000 then MyLabel := ‘TS::Health, Admin'; else MyLabel := ‘S::Health, Admin'; end if; 
Return TO_LBAC_DATA_LABEL('MyPolicy\ MyLabel); 

End; 

(2) APPLY_TABLE_POLICY ('MyPolicy', 'Admission’, 'Scheme',, ‘Which_Cost’) 



6 Conclusions and Future Work 

In this paper, we have presented an extension of the UML that allows us to represent 
main security aspects in the conceptual modeling of Data Warehouses. This extension 
contains the needed stereotypes, tagged values and constraints for a complete and 
powerful secure MD modeling. These new elements allow us to specify security as- 
pects such as security levels on data, compartments and user roles on the main ele- 
ments of a MD modeling such as facts, dimensions and classification hierarchies. We 
have used the OCL to specify the constraints attached to these new defined elements, 
thereby avoiding an arbitrary use of them. We have also sketched how to implement a 
secure MD model designed with our approach in a commercial DBMS. The main 
relevant advantage of this approach is that it uses the UML, a widely-accepted object- 
oriented modeling language, which saves developers from learning a new model and 
its corresponding notations for specific MD modeling. Furthermore, the UML allows 
us to represent some MD properties that are hardly considered by other conceptual 
MD proposals. 

Our immediate future work is to extend the implementation issues presented in this 
paper to allow us to use the considered security aspects when querying a MD model 
from OLAP tools. Moreover, we also plan to extend the set of privileges considered 
in this paper to allow us to specify security aspects in the ETL processes for DWs. 
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Abstract. Data integration systems represent today a key technolog- 
ical infrastructure for managing the enormous amount of information 
even more and more distributed over many data sources, often stored 
in different heterogeneous formats. Several different approaches provid- 
ing transparent access to the data by means of suitable query answering 
strategies have been proposed in the literature. These approaches often 
assume that all the sources have the same level of reliability and that 
there is no need for preferring values “extracted” from a given source. 
This is mainly due to the difficulties of properly translating and refor- 
mulating source preferences in terms of properties expressed over the 
global view supplied by the data integration system. Nonetheless pref- 
erences are very important auxiliary information that can be profitably 
exploited for refining the way in which integration is carried out. In this 
paper we tackle the above difficulties and we propose a formal framework 
for both specifying and reasoning with preferences among the sources. 
The semantics of the system is restated in terms of preferred answers 
to user queries, and the computational complexity of identifying these 
answers is investigated as well. 



1 Introduction 

The enormous amount of information even more and more distributed over many 
data sources, often stored in different heterogeneous formats, had boosted in re- 
cent years the interest for data integration systems. Roughly speaking, a data 
integration system offers transparent access to the data by providing users with 
the so-called global schema, which they can query in order to extract data rel- 
evant for their aims. Then, the system is in charge of accessing each source 
separately, and combining local results into the global answer. The means that 
the system exploit to answer users’ queries is the mapping specifying the rela- 
tionship between the sources and the global schema [16]. 

However, data at the sources, may result mutually inconsistent, because of 
the presence of integrity constraints specified on the global schema in order to 
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enhance its expressiveness. To remedy this problem, several papers (see, e.g., [3, 
6,4,11]) proposed to handle the inconsistency by suitably “repairing” retrieved 
data. Basically, such papers extend to data integration systems previous studies 
focused on a single inconsistent database or on the merging of mutually incon- 
sistent databases in a single consistent theory [2, 14, 17]. Intuitively, one aspect 
deserving particular care, which characterizes the inconsistency problem in data 
integration with respect to the latter works, is the presence of the mapping relat- 
ing data stored at the sources with the elements of the (virtual) global schema, 
over which constraints of interest for the integration application are issued. 

Here, the suitability of a possible repair depends on the underlying semantic 
assumptions which are adopted for the mapping and on the type of constraints 
on the global schema. Roughly speaking, the assumptions for the mapping pro- 
vide the means for interpreting data at the sources with respect to the intended 
extension of the global schema. In this respect, mappings are in general consid- 
ered sound, i.e., data that it is possible to retrieve from the sources through the 
mapping are assumed to be a subset of the intended data of the corresponding 
global elements [16]. This is for example the mapping interpretation adopted in 
[3, 6,4], where soundness is exploited for constructing those database extensions 
for the global schema that are enforced by the data stored at the sources and 
the mapping. Since obtained global databases may result inconsistent with re- 
spect to global constraints, suitable repairs (basically deletions and additions of 
tuples) are performed to restore consistency. 

None of the above mentioned works takes into account preference criteria 
when trying to solve inconsistencies among data sources. We could say that they 
implicitly assume that all the sources have the same level of reliability, and that 
there is no reason for preferring values coming from a source with respect to 
data retrieved from another source. On the other hand, in practical applications 
it often happens that some sources are known to be more reliable than others, 
thus determining some potentially useful criteria exploitable to establish the 
suitability of a repair. In other words, besides the semantic assumption on the 
mapping, also preference criteria expressed among sources should be taken into 
account when solving inconsistency. 

Despite the wide interest in this field, few efforts have been paid for enriching 
the data integration setting with qualitative or quantitative descriptions of the 
sources. The first (and almost isolated) attempt is in [18], where the authors 
introduce two parameters for characterizing each source: the soundness, which 
is used for assessing the confidence we can place in the answers provided by the 
source, and the completeness, which is used for measuring how many relevant 
information is stored in the source. However, the framework proposed does not 
fit the requirements of typical data integration systems, since it does not admit 
constraints over the global schema, and since it is only focused on the consistency 
problem, i.e., determining whether a global database exists that is consistent 
with all the claims of soundness and completeness of individual sources. 

Other works (see, e.g., [14,10,15]) deal instead with special cases, where 
preferences are defined among repairs of a single database, and, hence, they do 
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not capture the many facets of the data integration setting. In other words, such 
approaches do not tackle inconsistency in the presence of a mapping between 
the database schema, that has to be maintained consistent, and information 
sources that provide possibly inconsistent data. This is instead the challenging 
setting when tackling inconsistency in data integration in the presence of source 
preferences, which calls for suitable translations and reformulations, in which 
preferences between sources are mapped into preferences between repairs. 

In this paper, we face this problem by proposing a formal framework for 
both specifying and reasoning on preferences among sources. Specifically, the 
main contributions of this paper are the following. 

I> We introduce a new semantics which is based on the repair of data stored at 
the sources in the presence of global inconsistency, rather than considering 
the repair of global database instances constructed according to the map- 
ping. This approach is essentially a form of abductive reasoning [19], since 
it directly resolves the conflicts by isolating their causes at the sources. This 
part is described in Section 3. 

> We show that our novel repair semantics allow us to properly take into ac- 
count source preferences. Following the extensive literature (see, e.g., [8,7] 
and the references therein) from database community, prioritized logics, logic 
programming, and decision theory, we exploit two different approaches for 
specifying preferences among sources. Specifically, we consider unary and 
binary constraints for defining quantitative properties and relationships be- 
tween sources, respectively. 

> We show how preferences expressed over the sources can be exploited for 
refining the way of answering queries in data integration systems. To this 
aim, we introduce the concept of strongly preferred answers , characterizing 
the answers that can be obtained after the system is repaired according 
to users’ preferences. Actually, we also investigate a weaker semantics that 
looks for weakly preferred answers , i.e. , answers that are as close as possible 
to any strong preferred one. This part and the above one are described in 
Section 4. 

O Finally, the computational complexity of computing both strongly and 
weakly preferred answers is studied, by considering the most common in- 
tegrity constraints that can be issued on relational databases. We show that 
computing strongly preferred answers is co-NP-compete, and hence it is as 
difficult as computing answers without additional constraints [5]. However, 
while turning to the weak semantics, we evidence a small increase in complex- 
ity that does not lift the problem to higher levels of the polynomial hierarchy. 
Indeed, the problem is complete for the class P^^[0(log?r)]. Computational 
complexity is treated in Section 5. 

2 Relational Databases 

In this section we recall the basic notions of the relation model with integrity 
constraints. For further background on relational database theory, we refer the 
reader to [1] . 
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We assume a (possibly infinite) fixed database domain r whose elements can 
be referenced by constants ci,. c n under the unique name assumption, i.e. 
different constants denote different objects. A relational schema 1ZS is a pair 
(<?■, £), where: W is a set of relation symbols, each with an associated arity that 
indicates the number of its attributes, and A is a set of integrity constraints, i.e., 
assertions that have to be satisfied by each database instance. 

We deal with quantified constraints [1], i.e., first order formulas of the form: 

l m n 

Vx. f\ Ai D By. \J Bj\J \J <j> k (1) 

i — 1 j = 1 k= 1 

where l + m > 0, n > 0, A\, . . . Ai and B\,. . . B m are positive literals, <j>i, . . .<j> n 
are built-in literals, a? is a list of distinct variables, y is a list of variables occurring 
in B\, . . . B m only. Notice that classical constraints issued on a relational schema, 
as functional, exclusion, or inclusion dependencies, can be expressed in the form 
1. Furthermore, they are also typical of conceptual modelling languages. 

A database instance (or simply database ) VB for a schema 1ZS = (X', £) is 
a set of facts of the form r(t) where r is a relation of arity n in 'X and t is an 
n-tuple of constants of r. We denote as r VB the set {t \ r(t) € VB}. A database 
VB for a schema 1ZS is said to be consistent with 1ZS if it satisfies (in the first 
order logic sense) all constraints expressed on 1ZS. 

A relational query (or simply query) over 1ZS is a formula that is in- 
tended to extract tuples of elements of r. We assume that queries over 1ZS = 
(&,£) are Union of Conjunctive Queries (UCQs), i.e., formulas of the form 
q(x) conj 1 (x,y 1 ) V ••• V conj m (x , y m ) , where, for each i £ {l,...,m}, 

conj t(x, yf) is a conjunction of atoms whose predicate symbols are in tfq and in- 
volve x = Xi , . . . , X n and y t = . . . , F), ni , where n is the arity of the query, 

and each X k and each Y. t j : is either a variable or a constant of r. q(x) is called 
the head of q. Given a database VB for 1ZS, the answer to a UCQ q over VB, 
denoted q VB , is the set of n-tuples of constants (ci, . . . , c n ), such that, when sub- 
stituting each Xi with c,, the formula 3 y 1 .conj 1 (x, 2 /i) V- • •\/3y rn .conj rn (x, y m ) 
evaluates to true in VB. 

3 Data Integration Systems 

Framework. A data integration system I is a triple (Q,S,A4), where Q is the 
global (relational) schema of the form Q = S is the source (relational) 

schema of the form S = (<f -, ,0), i.e., there are no integrity constraints on the 
sources, and M is the mapping between Q and S. We assume that the mapping 
is specified in the global- as-view (GAV) approach [16], where every relation of 
the global schema is associated with a view, i.e., a query, over the source schema. 
Therefore, Xi is a set of UCQs expressed over S, where the predicate symbol in 
the head is a relation symbol of Q. 

Example 1 Consider the data integration system X 0 = (Q 0 , So, Xio) where the 
global schema Qo consists of the relation predicates employee (Name, Dept) and 
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boss(Employee, Manager ). The associated set of constraints contains the two 
following assertions (quantifiers are omitted) 

employ ee(X, Y) A boss{X i, Yi) D X A Yi; boss(X, Y) A boss(X i, Yi) D Yi X, 

stating that managers are never employees. The source schema So comprises the 
relation symbols si, s 2 , S 3 , and the mapping A4o contains the following UCQs 

employee {X, Y) <- si(X, Y); boss(X, Y) <- s 2 (X, Y) V s 3 (X, Y). n 

We call any database V for the source schema S a source database for X. 
Based on T> , we specify the semantics of X, which is given in terms of database 
instances for Q, called global databases for X. In particular, we construct a global 
database by evaluating each view in the mapping M. over V. Such a database is 
called retrieved global database, and is denoted by ret(X,V). 

Example 1 (contd.) Let T>o = {s\{Mary, Dl), S 2 (John, Mary), s^Albert, Bill)} 
be a source database. Then, the evaluation of each view in the mapping over T > 0 
is ret(Xo,T>o) = {employ ee(Mary, Dl), boss(John, Mary), boss(Albert, Bill)}. □ 

In general, the retrieved global database is not the only database that we 
consider to specify the semantics of X w.r.t. T>, but we account for all global 
databases that contain ret(T,V). This means considering sound mappings: data 
retrieved from the sources by the mapping views are assumed to be a subset 
of the data that satisfy the corresponding global relation. This is a classical as- 
sumption in data integration, where sources in general do not provide all the 
intended extensions of the global schema, hence extracted data are to be con- 
sidered sound but not necessarily complete. Next, we formalize the notion of 
mapping satisfaction. 

Definition 1 . Given a data integration system X = (Q,S,M), and a source 
database V for X, a global database B for X satisfies the mapping A4 w.r.t. V if 
B D ret (I, D). □ 

Notice that databases that satisfy the mapping might be inconsistent with 
respect to dependencies in E, since data stored in local and autonomous sources 
are not in general required to satisfy constraints expressed on the global schema. 
Furthermore, cases might arise in which no global database exists that satisfies 
both mapping and constraints over Q (for example when a key dependency on Q 
is violated by data retrieved from the sources). On the other hand, constraints 
issued over the global schema must be satisfied by those global databases that 
we want to consider “legal” for the system [16]. 

Repairing global databases. In order to solve inconsistency, several ap- 
proaches have been recently proposed, in which the semantics of a data inte- 
gration system is given in terms of the repairs of the global databases that 
the mapping forces to be in the semantic of the system [3,5,4], Such papers 
extend to data integration previous proposals given in the field of inconsistent 
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databases [12,2,14], by considering a sound interpretation of the mapping. In 
this context, repairs are obtained by means of addition and deletion of tuples 
over the inconsistent database. Modifications are performed according to min- 
imality criteria that are specific for each approach. Analogously, works on in- 
consistency in data integration basically propose to properly repair the global 
databases that satisfy the mapping in order to make them satisfy constraints on 
the global schema. In this respect, we point out that [3,4] consider local-as-view 
(LAV) mappings, where, conversely to the GAV approach, each source relation 
is associated with a query over the global schema. In such papers, the notion of 
retrieved global database is replaced with the notion of minimal global databases 
that can be constructed according to the mapping specification and data stored 
at the sources. Then, a global database satisfies the mapping if it contains at least 
one minimal global database. Repairs computed in [3,5,4] are in general global 
databases that do not satisfy the mapping. Furthermore, they cannot always be 
retrieved through the mapping from a source database instance. According to 
[16], we could say that in these approaches, constraints are considered strong, 
whereas the mapping is considered soft. 

Example 2 Consider the simple situation in which the global schema of a data 
integration system X = ( Q,S,M . } contains two relation symbols g\ and 52 both 
of arity 1 , that are mutually disjoint, i.e., the constraint VA, Y.gi(X) A (72(A) D 
I / 7 is issued over Q. Assume that the mapping A4 comprises the queries 
(71(A) <— s(A) and (72(A) <— s(A) where s is a unary source relation symbol. 
Let V = {s(a)} be the source database for X. Then, ret(X,V) = {(71(a), 52(a)} 
is inconsistent w.r.t. the global constraint. In this case, the above mentioned 
approaches propose to repair ret(X,V) by eliminating from each database sat- 
isfying the mapping either 51(a) or 52(a) (but not both), thus producing in the 
two cases two different classes of global databases that are in the semantics of 
the system. Notice, however, that each global database that contains only 51(a) 
or only 52(a) does not satisfy the soundness of the mapping, and cannot be re- 
trieved from any source database for X. □ 

Even if effective for repairing global database instances in the presence of 
inconsistency, the above approaches do not seem appropriate when preferences 
specified over the sources should be taken into account for solving inconsistency. 
Indeed, in these cases, one would prefer, for example, to drop tuples coming from 
less reliable source relations rather than considering all possible repairs to be at 
the same level of reliability. Nonetheless, it is not always easy to understand how 
preferences over tuples stored at the sources could be mapped on preferences over 
tuples of the global schema. 

Example 3 Consider for example the simple data integration system in which 
the mapping contains the query g(X,Y) <— ( s±(X,Z ) A S2{Z 1 Y)) V S3(A, Y), 
and a constraint stating that the first component of the global relation 5 is the 
key of 5. Assume to have the source database V = {si(a, b),S2(b, c),sz(a, d)}, 
and to know that source relation S3 is more reliable than source relation S2. 
Then, ref (X , V) = {g(a, c),g(a , d)} violates the key constraint on 5, and it seems 
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reasonable to prefer dropping the fact g(a,c), in order to guarantee consistency, 
rather than g(a,d), according to source preferences. However, we do not have 
a preference specified between this two global facts in such a way that we can 
adopt this choice. □ 

The above example shows that we should need some mechanism to infer 
preferences over tuples of the global schema starting from preferences at the 
sources. On the other hand, it is not always obvious or easy to define such a 
mechanism. A different solution could be to move the focus, when repairing, 
from tuples of the global schema to tuples of the sources, i.e., minimally modify 
the source database. In this way, we could compare two repairs (at the sources) 
on the basis of the preferences established over the source relations. 

Repairing the sources. The idea at the basis of our approach consists in 
finding the proper set of facts at the sources that imply as a consequence a 
global database that satisfy the integrity constraints. Basically, such a way of 
proceeding is a form of abductive reasoning [19]. Notice also that, according to 
this approach we consider “strong” both the mapping and the constraints, i.e., 
we take into account only global databases that satisfy both the mapping and 
the constraints on the global schema. Furthermore, each global database that 
we consider, can be computed by means of the mapping from a suitable source 
database. Let us now precisely characterize the ideas informally described above. 

Definition 2. Given a data integration system X = (G,S, M), where Q = 
(\P, £), and a source database V for X, X is satisfiable w.r.t. V if there exists a 
global database B for X such that 

— B satisfies £ w.r.t. V , and 

— B satisfies the mapping M. □ 

We next introduce a partial order between source databases for which the 
system in satisfiable. 

Definition 3. Given a data integration system X = (Q,S,A4), where Q = 
(&,£). Given two source databases T>i,T >2 C V for X such that X is satisfi- 
able w.r.t. V i and "D 2 - Then, we say that D i <(z,x>) £>2 if T>\ D V D X> 2 D V. 
Furthermore, V\ <(z,z>) X> 2 if V\ <(z,z>) 2? 2 and does not hold X> 2 <(x,-d) V\. □ 

We say that a source database V is minimal w.r.t. <(x,d)> if there does not 
exist V" such that V" <(z,d) £*' ■ Furthermore, we indicate with min < (X13) the 
set of source databases that are minimal w.r.t. <tx,v)- 

Example 1 (contd.) The retrieved global database violates the constraints 
on the global schema witnessed by the facts employee (Mary, Dl) and 
boss(John, Mary) for which Mary is both an employee and a manager. There- 
fore, X 0 is not satisfiable w.r.t. V 0 . Then, min<^ Xg x>g) = {X>i,X> 2 } where V i = 
{s\(Mary, Dl), s^(Albert, Bill)} and D 2 = {s 2 (Jo/in, Mary), s^(Albert, Bill)}. □ 

We are now able to define the semantics of a data integration system. 
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Definition 4. Given a data integration system X — {Q,S,M), where Q = 
{fI/,S), and given a source database V for 1, a global database B is legal for 
X w.r.t. V if 

— B satisfies SJ, 

— B satisfies the mapping M. w.r.t. a minimal source database V w.r.t. <(x,-d), 

i.e. , there exists V e min < (2 . jT)) such that B D ret( X,V). 

The set of all the legal databases is denoted by Leg{X, V). We point out that, 
under the standard cautious semantics, answering a query posed over the global 
schema Q amounts to evaluate it on every legal database B G Leg( X,T>). 

Example 1 (contd.) The set Leg(X, 0 ,2? 0 ) contains all global databases 
that satisfy the global schema and that contain either ret{X 0 ,T>i) = 
{employ ee{Mary, Dl), boss {Albert, Bill)} or ret{X, 0 ,^ 2 ) = {boss {John, Mary), 
boss{Albert, Bill)}. Then, the answer to the user query q{X) <— boss{X, Y), which 
asks for all employees that have a boss, is {{Albert}}. □ 

Summarizing, our approach consists in properly repairing the source database 
V in order to obtain another source database V such that X is satisfiable w.r.t. 
V . Obviously, if X is satisfiable w.r.t. V, we do not need to repair V. 

Before concluding, we point out that the set Leg{X,V) is in general different 
from the set of global databases that can be obtained by repairing the retrieved 
global database instead of repairing the source database V. This results evident 
in Example 2, in which repairing is performed by dropping s(o) at the sources, 
therefore legal databases exist that neither contain g 1 (a) nor g 2 (a) . 

We conclude this section by considering the difficulty of checking whether a 
global database is indeed a repair. Such a difficulty will be evaluated by following 
the data complexity approach [20], i.e., by considering a given problem instance 
having as its input the source database — this is, in fact, the approach we shall 
follow in all the complexity results presented in the paper. 

Theorem 4 (Repair Checking). Let X — {Q,S,A4) be a data integration 
systems with Q = (\P, XX), V a source database forX, and B a global database for 
X. Then, checking whether B is legal is feasible in polynomial time. 

4 Formalizing Additional Properties of the Sources 

In many real world applications, users often have some additional knowledge 
about data sources besides the mapping with the global schema, which can be 
modelled in terms of preference constraints specified over source relations. In 
this scenario, the aim is to exploit preference constraints for query answering in 
the presence of inconsistency. 

The framework we have introduced in Section 3 allows us to easily take into 
account information on such preferences when trying to solve inconsistency, since 
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repairing is performed by directly focusing on the sources, whose integration has 
caused inconsistency. 

Intuitively, when a data integration system X = (G,S, M) is equipped 
with some additional preference constraints, we can easily exploit these fur- 
ther requirements for identifying, given a source database T > , those elements of 
min<^ XT>) which are preferred for answering a certain query. In this respect, 
we distinguish between unary constraints, i.e., properties which characterize a 
given data source, and binary constraints , i.e., properties which are expressed 
over pairs of relations in the source schema S. 

4.1 Unary and Binary Constraints 

As already evidenced in [18], in order to provide accurate query answering, each 
relation r £ S can be equipped with two parameters: the soundness measure and 
the completeness measure. The former is used for assessing the confidence that 
we place in the answers provided by r, whereas the latter is used for evaluating 
how much relevant information is contained in r. In [18], the problem of querying 
partially sound and complete data sources has been studied in the context of 
data integration systems with LAV mapping and without integrity constraints 
on the global schema. In such a setting, it has been shown that deciding the 
existence of a global database satisfying some assumptions is NP-complete. 

Here, we extend such analysis for sound GAV mappings, in our repair se- 
mantics for data integration systems. In this framework, we observe that the 
completeness measure is of no practical interest, since each B £ Leg{X,V) is 
such that B D ret(X,V). Therefore, constraints that can be satisfied by adding 
tuples to ret{X,T>), can be seen as “automatically repaired”. Indeed, in our re- 
pair semantics we do consider addition of tuples at the sources in order to repair 
constraints on the global schema. Therefore, we are actually interested in bound- 
ing only the number of tuple deletions required at the sources in order to repair 
the system. Then, for each source relation r £ S, we denote by r s the value of 
such bound, also called soundness constraint, whose semantics is as follows. 

Definition 5. Let X = ( G,S, A4 ) be a data integration system, V a source 
database for X, r a relation symbol in S, and r s a soundness constraint for r. 
Then, a source database V £ min< (xv) satisfies r s if 



Even though in several situations the soundness measure is not directly avail- 
able for characterizing a source relation in an absolute way, the user might be 
able to compare the soundness of two different sources. For instance, he might 
not know the soundness constraint for source relations r\ and r 2 , but he might 
have observed that “r 1 is more reliable than 1 V’- Such intuition is formalized by 
the notion of binary constraints. 

Let r\ and r% be two relation symbols of S, and let A denote a set of pairs 
{P 1 ,...,P n }, such that P l = (A l ri ,A l r2 ), with A‘ ri and A‘ r2 attributes of n and 
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r 2 , respectively. Any expression of the form n -< 4 r 2 is a binary constraint over 
S , and its semantics is as follows. 

Definition 6. Let X = (G,S, M) be a data integration system and V a source 
database for X. Then, a source database V £ rnin< (IT)) satisfies a binary con- 
straint of the form 74 74 with P l = (A* , A} 2 ) for i = l..n, if 

Vfi S rf- v , 3 t 2 £ with 774^,.. .^n (*i) = 774^ ,...,4^ (t 2 ) 

where 774 1) ... ) 4 Il (ti) indicates the projection of the tuple t* on A 1, ..., A„. □ 

Roughly speaking, V satisfies 74 -<4 i~ 2 if for each tuple t\ that has been 
deleted from rf in order to obtain V', a tuple t 2 sharing the same values on 
the attributes in A has been deleted from rlf . This behavior guarantees that 74 
is modified only if there is no way for repairing the data integration system by 
modifying r 2 only. 

Example 1 (contd.) Assume now to specify the binary constraint s 2 -<($2,$i) 
Si over the source schema S , where $2 indicates the second attribute of S2 and 
$1 the first attribute of Si. Then, V\ — {s\{Mary, Dl), sz( Albert, Bill)} violates 
the constraint, since it is obtained by dropping from V the fact s 2 (John, Mary), 
whose second component coincides with the first component of S\{Mary, Dl), 
which conversely has not been dropped. On the contrary, it is easy to see that 
P2 = {s 2 (John, Mary), s^Albert, Bill)} satisfies the constraint. □ 



4.2 Soft Constraints 

As defined in the section above, unary and binary constraints often impose very 
severe restrictions on the possible ways repairs can be carried out. For instance, 
it might even happen that no minimal source database exists that satisfies such 
constraints, thereby leading to a data integration system with an empty seman- 
tics. In order to face such situations, whenever it is necessary we can also turn to 
“weak” semantics that looks for repairs as close as possible to the preferred ones. 
In this respect, preference constraints are interpreted in a soft version, and we 
aim at minimizing the number of violations, rather than imposing the absence 
of such violations. 



Definition 7 (Satisfaction Factor). Let X = (Q,S,M) be a data integration 
system, V a source database for X, and V be a source database in rain<^ XT>) . 
Then, the satisfaction factor wx>{p,X>') for a constraint p is 



— | r v v |, if p is of the form r s and V does not satisfies r s , or 

— the number of tuples t\ € r®~ x ’ such that $t 2 £ with 

7741 ,..., 2 !” (fi) = 7741 ..^n (t 2 ), if p has the form 74 -<{pi r .,p»} r 2 with 






□ 



Finally, the satisfaction factor of a set of constraints V is the value 
w v (V,V) = Epe-p w v{p,V)- 
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4.3 Preferred Answers 

After unary and binary constraints have been formalized, we next show how they 
can be practically used for pruning the set of legal databases of a data integration 
system. Specifically, we first focus on the definition of preferred legal databases, 
and we next show how this notion can be exploited for defining preferred answers. 

In the following, given a data integration system X, we denote by constriX) 
the set of preference constraints defined over S. Then, the pair X c = 
(X, constr(X )) is also said to be a constrained data integration system. The se- 
mantics of the system is provided in terms of those legal databases that are 
“retrieved” from source databases of min< (XT>) that satisfy constriX). 

Definition 8. Let X c = (X, constriX)) be a constrained data integration system 
with X = (' Q,S,M ) and Q = (X, E), and let V be a source database for X. Then, 
a global database B is a (weakly) preferred legal database for X c w.r.t. V if 

— B satisfies X 1 , 

— B D ref(X, V), where V is a minimal source database w.r.t. <(z.x>) such 
that no minimal source database V exists with wx>ip,X)") < wx>{p,X)'). 

If wt>(p,T>') = 0, then B is a strongly preferred legal database for X c w.r.t. V. □ 

We next provide the notion of answers to a query posed to a constrained 
data integration system. 

Definition 9. Given a constrained data integration system X c = (X, constriX )) 
with X = (Q, S, M), a source database V for X, and a query q of arity n over Q, 
the set of the weakly preferred answers to q , denoted ans~iq,X,V), is 

{(ci, . . . , c n ) | (ci, . . . , c n ) € q B for each weakly preferred legal database B } 

The set of the strongly preferred answers to q , denoted ans*iq,X, V), is 

{(ci, . . . , c„) | (ci, . . . , Cn) £ q B for each strongly preferred legal database B } □ 

Example 1 (contd.) Consider again the constraint S 2 -<($ 2 ,$i) «i- We have 
already observed that only X >2 satisfies such requirement. Then, the set of 
strongly preferred databases is {B \ B D ret(X ,V 2 )} ■ Therefore, for the query 
q(X) <— bossiX,Y), ans*iq,X, o,X>o) = {{Albert), (John)}. □ 

We conclude this section by observing that the constraints we have defined 
can be evaluated in polynomial time on a given global database. However, they 
suffice for blowing up the intrinsic difficulty of identifying (preferred) global 
databases. 

Theorem 5 (Preferred Repair Checking). Let X c = (X, constriX)) and let 
V be a source database for X. Then, checking whether a global database B is 
(strongly) preferred forX c w.r.t. V is NP -hard. 

Proof (Sketch). NP-hardness can be proven by a reduction of three colorability 
problem to our problem. Indeed, given a graph G we can build a data integration 
system X C (G), a source database V and a legal database B for X such that B is 
preferred <t=> G is 3-colorable. □ 
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5 Complexity Results 

We next study the computational complexity of query answering in a constrained 
data integration system X c under the novel semantics proposed in the paper. 
Our aim is to point out the intrinsic difficulty of dealing with constraints at the 
sources. Specifically, given a source database T> for X c , we shall face the following 
problems: 

— StronglyAnswering: given a UCQ q of arity n over Q and an n-tuple t of 
constants of -T-p, is t £ ans*(q, X C ,V)1 

— WeaklyAnswering: given a UCQ q of arity n over Q and an n-tuple t of 
constants of Up, is t € ans~ (q,X c ,V)7 

where 7p are constant of the domain r which occur also in tuples of D. 

We shall consider the (common) case in which Q = (F, £) is such that £ con- 
tains only key dependencies (KDs), functional dependencies (FDs) and exclusion 
dependencies (EDs). We recall that these are classical constraints issued on a 
relational schema, and that they can be expressed in the form (1) introduced 
in Section 2. We also point out that violations of constraints of such form, e.g., 
two global tuples violating a key constraint, lead always to inconsistency in our 
framework, since they can be repaired only by means of tuple deletions from the 
source database. We are now ready to provide the first result of this section. 

Theorem 6. Let X c = (T, constr(I)) be a constrained data integration system 
with X = (Q,S,A4) where Q = (F, £) in which £ contains only FDs and EDs, 
and let V be a source database forX c . Then, the StronglyAnswering problem 
is co-NP -complete. Hardness holds even if £ contains either only KDs or only 
EDs, and if constr(X) is empty. 

Proof (Sketch). As for the membership, we consider the dual problem of deciding 
whether t £ ans*(q,X,V), and we show that it is feasible in NP. In fact, we can 
guess a source database V obtained by removing tuples of V only. Then, we 
can show how to verify that D' is minimal w.r.t. <(z,x>) in polynomial time. 

Hardness for the general case can be derived from the results reported in 
[5] and in [9] (where the problem of query answering under different semantics 
in the presence of KDs is studied). Hardness for EDs only can be proven in 
an analogous way by a reduction from the three colorability problem to the 
complement of our problem. □ 

The above result suggests that adding constraints to the sources, enriches the 
representation features of a data integration systems, and it is well-behaved from 
a computational viewpoint. In fact, selecting preferred answers is as difficult as 
selecting answers without additional preference constraints, whose complexity 
has been widely studied in [5]. 

We next turn to the WeaklyAnswering problem, in which a weaker semantics 
is considered. Intuitively, this scenario provides an additional source of complex- 
ity, since finding weakly preferred global databases amount at solving an implicit 
(NP) optimization problem. Interestingly, the increase in complexity is rather 
small and does not lift the problem to higher levels of the polynomial hierarchy. 
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Actually, the problem stays within the polynomial time closure of NP, i.e. , P^ 3 . 
More precisely, it is complete for the class P^P [O(logn)], in which the NP oracle 
access is limited to O(logn) queries, where n is the size of the source database 
in input. 

Theorem 7. Let l c = (T, constriX)) be a constrained data integration system 
with X = (G,S,M) where Q — (E, E) in which E contains only FDs and EDs, 
and let V be a source database forX c . Then, the WeaklyAnswering problem is 
pNP [o(log n)\-complete. 

Proof (Sketch). For the membership, we can preliminary compute the maximum 
value, say max, that the satisfaction factor for any source database may assume. 
Then, by a binary search in [0, max], we can compute the best satisfaction factor, 
say c: at each step of this search, we are given a threshold s and we call an NP 
oracle to know whether there exists a source database V such that wx>(p, V) < 
s. Finally, we ask an other NP oracle for checking whether there exists a source 
database T>" with satisfaction factor c such that t does not belong to q B , for the 
minimal B D ret (I, V") satisfying the constraints in Q. 

Hardness can be proved by a reduction from the following problem: Given a 
formula L> in conjunctive normal form on the variables Y = {Yi, ..., Y n }, a subset 
X C Y, and a variable Yj, decide whether Y, is true in all the A-MAXIMUM 
models, where a model M (satisfying assignment) is A-MAXIMUM if it has the 
largest A-part, i.e., if the number of variables in the set X that are true w.r.t. 
M is the maximum over all the satisfying assignments. □ 

6 Conclusions 

In this paper we have introduced and formalized the problem of enriching data 
integration systems with preferences among sources. Our approach is based on 
a novel semantics which relies on repairing the data stored at the sources in 
the presence of global inconsistency. Repairs performed at the sources allow us 
to properly take into account preference expressed over the sources when try- 
ing to solve inconsistency. Exploiting the presence of preference constraints, we 
have introduced the notion of (strongly and weakly) preferred answers. Finally, 
we have studied the computational complexity of computing both strongly and 
weakly preferred answers for classes of key, functional end exclusion dependen- 
cies, which are relevant classes of constraints for relational databases as well as 
conceptual modelling languages. 

Complexity results given in this paper can be easily extended to the presence 
of also inclusion dependencies on the global schema in the cases in which the 
problem of query answering is decidable, which have been studied in [5]. To the 
best of our knowledge, the present work is the first one that provide formal- 
izations and complexity results to the problem of dealing with inconsistencies 
by taking into account preferences specified among data sources in a pure GAV 
data integration framework. Only recently, the same problem has been studied 
for LAV integration systems in [13]. 
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Abstract. In schema integration, schematic discrepancies occur when data in 
one database correspond to metadata in another. We define this kind of seman- 
tic heterogeneity in general using the paradigm of context that is the meta in- 
formation relating to the source, classification, property etc of entities, relation- 
ships or attribute values in entity-relationship (ER) schemas. We present 
algorithms to resolve schematic discrepancies by transforming metadata into 
entities, keeping the information and constraints of original schemas. Although 
focusing on the resolution of schematic discrepancies, our technique works 
seamlessly with existing techniques resolving other semantic heterogeneities in 
schema integration. 



1 Introduction 

Schema integration involves merging several schemas into an integrated schema. 
More precisely, [4] defines schema integration as “the activity of integrating the 
schemas of existing or proposed databases into a global, unified schema”. It is re- 
garded as an important work to build a heterogeneous database system [6, 22] (also 
called multidatabase system ox federated database system), to integrate data in a data 
warehouse, or to integrate user views in database design. In schema integration, peo- 
ple have identified different kinds of semantic heterogeneities among component 
schemas: naming conflict (homonyms and synonyms), key conflict, structural conflict 
[3, 15], and constraint conflict [14, 21]. 

A less touched problem is schematic discrepancy, i.e., the same information is 
modeled as data in one database, but metadata in another. This conflict arises fre- 
quently in practice [11, 19]. We adopt a semantic approach to solve this issue. One of 
the outstanding features of our proposal is that we preserve the cardinality constraints 
in the transformation/integration of ER schemas. Cardinality constraints, in particular, 
functional dependencies (FDs) and multivalued dependencies (MVDs), are useful in 
verifying lossless schema transformation [10], schema normalization and semantic 
query optimization [9, 21] in multidatabase systems. The following example illus- 
trates schematic discrepancy in ER schemas. To focus our contribution and simplify 
the presentation, in the example below, schematic discrepancy is the only kind of 
conflicts among schemas. 

Example 1. Suppose we want to integrate supply information of products from sev- 
eral databases (Fig. 1). These databases record the same information, i.e., product 
numbers, product names, suppliers and supplying prices in each month, but have 
discrepant schemas. In DB1, suppliers and months are modeled as entity types. In 
DB2, months are modeled as meta-data of entity types, i.e., each entity type models 
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the products supplied in one month, and suppliers are modeled as meta-data of attrib- 
utes, e.g., the attribute S1_PRICE records the supplying prices by supplier si 1 . In 
DB3, months are modeled as meta-data of relationship types, i.e., each relationship 
type models the supply relation in one month. We propose (in Section 4) to resolve 
the discrepancies by transforming the metadata into entities, i.e., transforming DB2 
and DB3 into a form of DB1. The statements on the right side of Fig. 1 provide the 
semantics of the constructs of these schemas using ontology, which will be explained 
in Section 3.a 



DBl: 




PROD = product 

{P# = p#, PNAME = pname} 
SUPPLIER = supplier 
{S# = s#} 

MONTH = month 

{MONTH = month} 

SUP = supply 

{PRICE = price } 



DB2: 





JANPROD = product[month='jan r \ ... 

{P# = p#, PNAME = pname, 

S1_PRICE = price\supplier—sl\ inherit ALL], 

JAN_SUP — supply\month='jan r \ 

{PRICE = /?n'ce[inherit ALL]} 



Fig. 1. Schematic discrepancy: months and suppliers modeled differently in DBl, DB2 and DB3 



Paper organization. The rest of the paper is organized as follows. Section 2 is an 
introduction to the ER approach. Section 3 and 4 are the main contributions of this 
paper. In Section 3, we first introduce the concepts of ontology and context, and the 
mappings from schema constructs of ER schemas onto types of ontology. Then we 
define schematic discrepancy in general using the paradigm of context. In Section 4, 
we present algorithms to resolve schematic discrepancies in schema integration, with- 
out any loss of information and cardinality constraints. In Section 5, we compare our 
work with related work. Section 6 concludes the whole paper. 



1 Without causing confusion, we blur the difference on entities and identifiers of entities. E.g., 
we use supplier number si to refer to a supplier with identifier S# = si, i.e., si plays both the 
roles of an attribute value of S# and an entity of supplier. 
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2 ER Approach 

In the ER model, an entity is an object in the real world and can be distinctly identi- 
fied. An entity type is a collection of similar entities that have the same set of prede- 
fined common attributes. Attributes can be single-valued, i.e., 1:1 (one-to-one) or m:l 
(many-to-one), or multivalued, i.e., l:m (one-to-many) or m:m (many-to-many). A 
minimal set of attributes of an entity type E which uniquely identifies E is called a key 
of E. An entity type may have more than one key and we designate one of them as the 
identifier of the entity type. A relationship is an association among two or more enti- 
ties. A relationship type is a collection of similar relationships that satisfy a set of 
predefined common attributes. A minimal set of attributes (including the identifiers of 
participating entity types) in a relationship type R that uniquely identifies R is called a 
key of R. A relationship set may have more than one key and we designate one of 
them as the identifier of the relationship type. 

The cardinality constraints of ER schemas incorporate FDs and MVDs. For exam- 
ple, given an ER schema below, let Kl, K2 and K3 be the identifiers of El, E2 and 
E3, we have: 

Kl— >A1 and Al— >K1, as A1 is a 1:1 attribute of El; 

K2— »A2, as A2 is a m: 1 attribute of E2; 

K3 A3, as A3 is a m:m attribute of E3; 

{Kl, K2}— >K3, as the cardinality of E3 is 1 in R; 

{Kl, K2}— >B, as B is a m:l attribute of R. 




E3 



3 Ontology and Context 

In this section, we first represent the constructs of ER schemas using ontology, then 
define schematic discrepancy in general based on the schemas represented using on- 
tology. In this paper, we treat ontology as the specification of a representational vo- 
cabulary for a shared domain of discourse which includes the definitions of types 
(representing classes, relations, and properties) and their values. We present ontology 
at a conceptual level, which could be implemented by an ontology language, e.g., 
OWL [20], 

For example, suppose ontology SupOnto describes the concepts in the universe of 
product supply. It includes the following types: product, month, supplier, supply (i.e., 
the supply relations among products, months and suppliers), price (i.e. the supplying 
prices of products), p#, pname, s#, etc. It also includes the values of these types, e.g. 
jan, ..., dec for month, and si, ..., sn for supplier. Note we use lower case italic words 
to represent types and values of ontology, in contrast to capitals for schema constructs 
of an ER schema. By use of OWL expression, product, month, supplier and supply 
would be declared as classes, p# and pname as properties of product, s# as a property 
of supplier, and price as a property of supply. 

Conceptual modeling is always done within a particular context. In particular, the 
context of an entity type, relationship type or attribute is the meta-information relating 
to its source, classification, property etc. Contexts are usually at four levels: database, 
object class, relationship type and attribute. An entity type may "inherit" a context 
from a database (i.e., the context of a database applies to the entities), and so on. In 
general, the inheritance hierarchy of contexts at different levels is: 
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Database 



Entity type 









Relationship type 



Attribute of rela- 
tionship type 



Attribute of entity type 



We’ll give a formal representation of context below. Note as the context of a data- 
base would be handled in the object classes which inherit it, we will not care database 
level contexts any more in the rest of the paper. 

Definition 1. Given an ontology, we represent an entity type (relationship type, or 
attribute) E as: 

E = T[C 1 =c 1 , C m =c m , inherit C m+1 , ...,C n ] 

where T, C l5 ..., C n are types in the ontology, and each c ; is a value of Cj for i e 
{1, ..., m}. C p ..., C n respectively have a value of c +1 , ..., c n which are not ex- 
plicitly given. This representation means that each instance of E is a value of T, and 
satisfies the conditions Cj = Cj for each i e { 1 , .... n }. C,, .... C n with the values con- 
stitute the context within which E is defined; we call them meta-attributes , and their 
values metadata of E. Furthermore, C m+1 , ..., C n with the values are from the context 
at a higher level (i.e. the context of a database if E is an entity type, the contexts of 
entity types if E is a relationship type, or the context of an entity type/relationship 
type if E is an attribute). We call E inherits the meta-attributes C m+1 , ..., C n with the 
values. If E inherits all the meta-attributes with values of the higher level context, we 
simply represent it as: 

E = T [Cj=Cj, ..., C m =c m , inherit ALL], 

For easy reference, we call the set {C[=Cj, ..., C m =c m } the self context, and 
{C m+] = c m+1 , ..., C n =c n ) the inherited context of E. □ 

In the above representation of E, either self or inherited context could be empty. 
Specifically, when the context of E is empty, we have E = T. 

In the example below, we represent the entity types, relationship types and attrib- 
utes in Fig. 1 using the ontology SupOnto. 

Example 2. In Fig. 1, using the ontology SupOnto, the entity type JAN_PROD of 
DB2 is represented as: 

JAN_PROD = product[month = 'jan ’]. 

That is, the context of JAN_PROD is month=‘jan’. This means that each entity of 
JAN_PROD is a product supplied in Jan. 

Also in DB2, given an attribute S1_PRICE of the entity type JAN_PROD, we rep- 
resent it as: 

S1_PRICE = price[supplier= ’si inherit ALL]. 

That is, the self context of S1_PRICE is supplier =’ si’ , and the inherited context 
(from the entity type) is month= ’jan’. This means that each value of S1_PRICE of the 
entity type JAN_PROD is a price of a product supplied by supplier si in the month of 
Jan. 
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In DB3, given a relationship type JAN_SUP, we represent it as: 

JAN_SUP = supply[month = ‘jan’]. 

This means that each relationship of JAN_SUP is a supply relationship in the month 
of Jan. 

Also in DB3, given an attribute PRICE of the relationship type JAN_SUP, we rep- 
resent it as: 

PRICE = price] inherit ALL]. 

PRICE inherits the context month= 'jan ’ from the relationship type. This means that 
each value of PRICE of the relationship type JAN_SUP is a supplying price in Jan.n 

In contrast to original ER schemas, we call an ER schema whose schema con- 
structs are represented using ontology symbols elevated schema , as the ER schemas 
with the statements given in Fig. 1. The mapping from an ER schema onto an elevated 
schema should be specified by users. Our work is based on elevated schemas. Now 
we can define schematic discrepancy in general as follows. 

Definition 2. Two elevated schemas are schematic discrepant , if metadata in one 
database correspond to attribute values or entities in the other. We call meta-attributes 
whose values correspond to attribute values or entities in other databases discrepant 
meta-attributes. □ 

For example, in Fig. 1, in DB2, month and supplier are discrepant meta-attributes 
as their values correspond to entities in DB1, so is the meta-attribute month in DB3. 

Before ending this section, we define the global identifier of a set of entity types. 
In general, two entity types (or relationship types) El and E2 are similar , if 
El=T[Cntl] and E2=T[Cnt2] with T an ontology type, and Cntl and Cnt2 two sets 
(possibly empty sets) of meta-attributes with values. Intuitively, a global identifier 
identifies the entities of similar entity types, independent of context. 

Definition 3. Given a set of similar entity types E, let K be an identifier of each entity 
type in E. We call K a global identifier of the entity types of E, provided that if two 
entities of the entity types of E refer to the same real world object, then the values of 
K of the two entities are the same, and vice versa. □ 

For example, in Fig. 1, the PROD entity types of DB1 and DBS, and the entity 
types JAN_PROD, .... DEC_PROD of DB2 are similar entity types, for they all cor- 
respond to the ontology type product without or with a context. Suppose P# is a 
global identifier of these entity types, i.e., P# uniquely identifies products from all the 
three databases. Similarly, we suppose S# is a global identifier of the SUPPLIER 
entity types of DB1 and DBS. 

In [13], Lee et al proposes an ER based federated database system where local 
schemas modeled in the relational, object-relational, network or hierarchical models 
are first translated into the corresponding ER export schemas before they are inte- 
grated. Our approach is an extension to theirs by using ontology to provide semantics 
necessary for schema integration. In general, local schemas could be in different data 
models. We first translate them into ER or ORASS schemas (ORASS is an ER-like 
model for semi- structured data [25]). Then map the schema constructs of ER schemas 
onto the types of ontology and get elevated schemas with the help of semi-automatic 
tools. Finally, integrate the elevated schemas using the semantics of ontology; seman- 
tic heterogeneities among elevated schemas are resolved in this step. Integrity con- 
straints on the integrated schema are derived from the constraints on the elevated 
schemas at the same time. 
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4 Resolving Schematic Discrepancies 
in the Integration of ER Schemas 

In this section, we resolve schematic discrepancies in schema integration. In particu- 
lar, we present four algorithms to resolve schematic discrepancies for entity types, 
relationship types, attributes of entity types and attributes of relationship types respec- 
tively. This is done by transforming discrepant meta- attributes into entity types. The 
transformations keep the cardinalities of attributes and entity types, and therefore 
preserve the FDs and MVDs. Note in the presence of context, the values of an attrib- 
ute depend on not only the identifier of an entity type/relationship type, but also the 
metadata of the attribute. To simplify the presentation, we only consider the discrep- 
ant meta-attributes of entity types, relationship types and attributes, leaving the other 
meta-attributes out as they will not change in schema transformation. 

In the rest of this section, we first present Algorithm TRANS_ENT and 
TRANS_REL, the resolutions of discrepancies for entity types and relationship types 
in Section 4.1, and then TRANS_ENT_ATTR and TRANS_REL_ATTR, the resolu- 
tions for attributes of entity types and attributes of relationship types in Section 4.2. 
Examples are provided to understand each algorithm. 



4.1 Resolving Schematic Discrepancies for Entity Types/Relationship Types 

In this sub-section, we first show how to resolve discrepancies for entity types using 
the schema of Fig. 1, then present Algorithm TRANS_ENT in general. Finally, we 
describe the resolution of discrepancies for relationship types by an example, omitting 
the general algorithm which is similar to TRANS_ENT. 

As an example to remove discrepancies for entity types, we transform the schema 
of DB2 in Fig. 1 below. 

Example 3 (Fig. 2). In Step 1, for each entity type of DB2, say JAN_PROD, we rep- 
resent the meta-attribute month as an entity type MONTH consisting of the only entity 
jan that is the metadata of JAN_PROD. We change the entity type JANJPROD into 
PROD after removing the context, and construct a relationship type R to associate the 
entities of PROD and the entity of MONTH. Then we handle the attributes of 
JAN_PROD. As PNAME has nothing to do with the context month = ‘jan’ of the 
entity type, it becomes an attribute of PROD. However, S1_PRICE, ..., SN_PRICE 
inherit the context of month; they become the attributes of the relationship type R. 
Then in Step 2, the corresponding entity types, relationship types and attributes are 
merged respectively. The merged entity type of MONTH consists of all the entities 
[jan, dec} of the original MONTH entity types, so do the entity type PROD, rela- 
tionship type R and their attributes. □ 

Then we give the general algorithm below. 

Algorithm TRANS_ENT 

Input: an elevated schema DB. 

Output: a schema DB' transformed from DB such that all the discrepant meta-attributes of 
entity types are transformed into entity types. 
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Step 1 : Resolve the discrepant meta-attributes of an entity type. 

Let E = E nnt (C | =c | , . .., C m =c m ] be an entity type of DB, for E offl a type in the ontology 
and discrepant meta-attributes C p C m with the values c p ..., c m . Let K be the 
global identifier of E. 

Step 1.1: Transform discrepant meta-attributes C p , C m into entity types. 

Construct an entity type E’ = E offl with the global identifier K. E' consists of the en- 
tities of E without any context. 

Construct entity types E c = C with identifier K c = C for each meta-attribute C ; 
e {Cp .... C m }. Each E c contains one entity a. 

//Construct a relationship type to represent the associations among the entities of E 
and the values of C C . 

Construct a relationship type R connecting the entity types E’ and E c , E c . 

Step 1.2: Handle the attributes of E. 

Let A be an attribute (not part of the identifier) of E, and selfCnt, a set of meta- 
attributes with values, be the self context of A. 

If A is a m:l or m:m attribute, then 

Case 1: attribute A has nothing to do with the context of E. Then A becomes an 
attribute of E'. 

Case 2: attribute A = A um [selfCnt, inherit ALL] inherits all the context {Cj=Cp 
. . ., C =c ] from E. Then A" = A , [selfCnt] becomes an attribute of R. 

’ m m 1 ont L J 

Case 3: attribute A = A [selfCnt, inherit S] inherits some discrepant meta- 
attributes S c: { Cj, ..., C m ] with the values from E, S 0. Then construct a re- 
lationship type R a connecting E' and those E c . for each meta-attribute C ; e S. 
Attribute A’= A ont [selfCnt] becomes an attribute of R A . 

Else //A is a 1:1 or l:m attribute, i.e., the values of A determine the entities 
of E in the context. In this case, A should be modeled as an entity type to present 
the cardinality constraint. We keep the discrepant meta-attributes of A. and delay 
the resolution in Alg. TRANS_ENT_ATTR, the resolution for attributes of entity 
types. 

Construct an attribute A’ = A on( [Cnt] of E\ with Cnt the (self and inherited) 
context of A as the (self) context of A’. 

Step 1.3: Handle relationship types involving entity type E in DB. 

Let R1 be a relationship type involving E in DB. 

Case 1: R1 has nothing to do with the context of E. Then replace E with E’ in R1 . 
Case 2: R1 inherits all the context {Cj=c p ..., C m =c m ) from E. Then replace E 
with R (i.e., treat R as a high level entity type) in R1 . 

Case 3: R1 inherits some discrepant meta-attributes S a {C p ..., C m ] with the 
values from E, S =£ 0. Then construct a relationship type R2 connecting E’ and 
those E c for each meta-attribute Cj e S. Replace E with R2 in Rl. 

Step 2: Merge the entity types, relationship types and attributes respectively which corre- 
spond to the same ontology type with the same context, and union their domains. □ 

In the resolution of schematic discrepancies for relationship types, we should deal 
with a set of entity types (participating in a relationship type) instead of individual 
ones. The steps are similar to those of Algorithm TRANS_ENT, but without Step 1.3. 
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DB2: 




JAN_PROD = product\month='jan r ) . — . 

{P# = p#, PNAME = pname, Step 2 \\ 

SI PRICE = price\supplier—sl\ inherit ALL], 




PROD — product S1_PRICE= price[supplier='sl ] 
MONTH = month 



Fig. 2. Resolve schematic discrepancies for entity types 

We omit the resolution algorithm TRANS_REL for lack of space, but explain it by 
any example below, i.e., transforming the schema of DB3 in Fig. 1. 

Example 4 (Fig. 3). In Step 1, for each relationship type of DB3, say JAN_SUP, we 
represent the meta- attribute month as an entity type MONTH consisting of the only 
entity jan that is the metadata of JAN_SUP. We change JAN_SUP into the relation- 
ship type SUP after removing the context, and relate the entity type MONTH to SUP. 
Attribute PRICE of JAN_SUP inherits the context month=’jan’ from the relationship 
type, and therefore it becomes an attribute of SUP in the transformed schema. Then in 
Step 2, the MONTH entity types are merged into one consisting of all the entities 
{jan, dec}; the SUP relationship types are also merged, and get the schema of 
DB1 in Fig. 1. □ 



4.2 Resolving Schematic Discrepancies for Attributes 

In this sub-section, we first show how to resolve discrepancies for attributes of entity 
types using an example, then present Algorithm TRANS_ENT_ATTR in general. 
Finally, we describe the resolution of discrepancies for attributes of relationship types 
by an example, omitting the general algorithm which is similar to TRANS_ENT_ 
ATTR. 
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Fig. 3. Resolve schematic discrepancies for relationship types 

The following example shows how to resolve discrepancies for attributes of entity 
types. Note the discrepancies of entity types should be resolved before this step. 

Example 5 (Fig. 4). Suppose we have another database DB4 recording the supplying 
information, in which all the suppliers and months are modeled as contexts of the 
attributes in an entity type PROD. The transformation is given in Fig. 4. In Step 1, for 
each attribute with discrepant meta-attributes, say S1_JAN_PRICE, the meta- 
attributes supplier and month are represented as entity types SUPPLIER and MONTH 
consisting of one entity si and jan respectively. A relationship type SUP is con- 
structed to connect PROD, MONTH and SUPPLIER. After removing the context, we 



SI JAN PRICE 
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Fig. 4. Resolve schematic discrepancies for attributes of entity types 
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change S1_JAN_PRICE into PRICE, an attribute of the relationship type SUP. Then 
in Step 2, we merge all the corresponding entity types, relationship types and attrib- 
utes, and get the schema of DB1 in Fig. 1. □ 

Then we give the general algorithm below. 

Algorithm TRANS_ENT_ATTR 

Input: an elevated schema DB. 

Output: a schema DB’ transformed from DB such that all the discrepant meta-attributes of 
attributes of entity types are transformed into entity types. 

Step 1 : Resolve the discrepant meta-attributes of an attribute in an entity type. 

Given an entity type E of DB, let A = A ont [C | =c ] , ..., C m =c m ] be an attribute (not part 
of the identifier) of E, for A on( a type in the ontology, and C p ..., C m the discrepant 
meta-attributes with the values c p . . ., c m . //Note A has no inherited context which has 
been removed in Algorithm TRANS_ENT if any. 

// Represent the discrepant meta-attributes as entity types. 

Construct entity types E c = C with identifiers K c = C ; for each meta-attribute CjS 
{C p ..., C m ). Each E c contains one entity c. 

If A is a m: 1 or m:m attribute, then 

//Construct a relationship type to represent the associations among the entities 
ofE and the values ofC p ..., C m . 

Construct a relationship type R connecting the entity types E and E C[ , ..., 

E c • 

'-m 

Attribute A’ = A , becomes an attribute of R. 

ont 

Else // A is a 1:1 or l:m attribute, i.e., the values of A determines the entities of E in 
the context. A should be modeled as an entity type to preserve the cardinality 
constraint. 

Construct E.. = A . with the identifier A’ = A .. 

A ont ont 

Construct a relationship type R connecting the entity types E, E A ,, and E C[ , . . ., 

E c • 

'-m 

Represent the FD {A’, C r .... C } — >K as the cardinality constraint on R. 

If A is a 1:1 attribute, also represent the FD {K, C p . .., C m }— >A’ on R. 

Step 2: Merge the entity types, relationship types and attributes respectively which corre- 
spond to the same ontology type with the same context, and union their domains. □ 

The resolution of schematic discrepancies for the attributes of relationship types is 
similar to that for the attributes of entity types, as a relationship type could be treated 
as a high level entity type. We omit the resolution algorithm TRANS_REL_ATTR for 
lack of space, but explain it by an example below. 

Example 6 (Fig. 5). Given the transformed schema of Fig. 2, we transform the attrib- 
utes of the relationship type R as follows. In Step 1, for each attribute of R, say 
S1_PRICE, we represent the meta-attribute supplier as an entity type SUPPLIER with 
one entity si, and construct a relationship type SUP to connect the relationship type R 
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and entity type SUPPLIER. After removing the context, we change S1_PR1CE into 
PRICE, an attribute of SUP. Then in Step 2, we merge the SUPPLIER entity types 
and SUP relationship types respectively. In the merged schema, the relationship type R 
is redundant as it is a projection of SUP and has no attributes. Consequently, we remove 
R and get the schema of DB 1 in Fig. 1 . □ 




The transformations of the algorithms (in Section 4.1 and 4.2) correctly preserve 
the FDs/MVDs in the presence of context, as shown in the following proposition. 

Proposition 1. Let E be a set of similar entity types (or relationship types) with the 
same set of discrepant meta-attributes, and K be the global identifier of E (or a set of 
global identifiers of entity types if £ is a set of relationship types). Suppose each en- 
tity type (or relationship type) of E has a set of attributes with the same cardinality: 

A ={ A | A = A^JC^Cj, ..., C m =c m , inherit C m+1 , ..., C n ], c^don^C;) for 1< i <m}. 

Then in the transformed schema, Cj, ..., C n are modeled as entity types, and the fol- 
lowing FDs/MVDs hold: 

Case 1: Jl are m:l attributes. Then A t is modeled as an attribute A’= A t , and a 
FD {K, Cj, ..., C n } — > A’ holds. 

Case 2: J4 are m:m attributes. Then A ont is modeled as an attribute A’= A ont , and a 
MVD {K, C 1( ..., C n } A’ holds. 

Case 3: Jh are 1:1 attributes. Then A ont is modeled as an entity type with the identi- 
fier A’ = A ont , and FDs {K, Cj, ..., C n } — > A’ and {A’, C t , ..., C n } H> Khold. 
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Case 4: J4 are l:m attributes. Then A ont is modeled as an entity type with the identi- 
fier A’= A ont , and a FD {A’, Cj, C n } — » K holds. □ 

For lack of space, we only prove Case 1 when E is a set of entity types. In a trans- 
formed schema, given two relationships with values on A': (k, c 1? ..., c m , a) and (k', 
Cj', ..., c ', a') for k and k' values (or value sets) of K, Cj, ..., c m and Cj', ..., c ' values 
of Cj, ..., C , and a and a' values of A'. If k=k', c 1 =c 1 ', c 2 =c,',..., c m =c m ', then in the 
original schemas, the two relationships correspond to the same entity and same attrib- 
ute, say Ale Jl. As A is a m:l attribute, we have a=a'. That is, the FD {K, Cj, ..., C n } 
— > A’ holds in the transformed schema. 

In schema integration, schematic discrepancies of different schema constructs 
should be resolved in order, i.e., first for entity types, then relationship types, finally 
attributes of entity types and attributes of relationship types. The resolutions for most 
of the other semantic heterogeneities (introduced in Section 1) follow the resolution 
of schematic discrepancies. 



5 Related Work 

Context is the key component in capturing the semantics related to the definition of an 
object or association. The definition of context as a set of meta-attributes with values 
is originally adopted in [7, 23], but is used to solve different kinds of semantic hetero- 
geneities. Our work complements rather than competes with theirs. Their work is 
based on the context at the attribute level only. We consider the contexts at different 
levels, and the inheritance of context. 

A special kind of schematic discrepancy has been studied in multidatabase interop- 
erability, e.g. [11, 12, 16, 17, 19], and [2]. They dealt with the discrepancy when 
schema labels (e.g., relation names or attribute names) in one database correspond to 
attribute values in another. However, we use contexts to capture meta-information, 
and solve a more general problem in the sense that schema constructs could have 
multiple (instead of single) discrepant meta-attributes. Furthermore, their works are at 
the “structure level”, i.e., they did not consider the constraint issue in the resolution of 
schematic discrepancies. However, the importance of constraints can never be overes- 
timated in both individual and multidatabase systems. In particular, we preserve FDs 
and MVDs during schema transformation, which are expressed as cardinality con- 
straints in ER schemas. 

The purposes are also different. Previous works focused on the development of a 
multidatabase language by which users can query across schematic discrepant data- 
bases. However, we try to develop an integration system which can detect and resolve 
schematic discrepancies automatically given the meta-information on source schemas. 

The issue of inferring view dependencies was introduced in [1, 8]. However, their 
works are based on the views defined using relational algebra. In other words, they 
did not solve the inference problem in the transformations between schematic dis- 
crepant schemas. In [14, 21, 24], people have begun to focus on the derivation of 
constraints for integrated schemas from constraints of component schemas in schema 
integration. However, these works failed to consider schematic discrepancy in schema 
integration. Our work complements theirs. 
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6 Conclusions and Future Works 

Information integration provides a competitive advantage to businesses, and becomes 
a major area of investment by software companies today [18]. In this paper, we re- 
solve a common problem in schema integration, schematic discrepancy in general, 
using the paradigm of context. We define context as a set of meta-attributes with 
values, which could be at the levels of databases, entity types, relationship types, and 
attributes. We design algorithms to resolve schematic discrepancies by transforming 
discrepant meta-attributes into entity types. The transformations preserve information 
and cardinality constraints which are useful in verifying lossless schema transforma- 
tion, schema normalization and query processing in multidatabase systems. 

We have implemented a schema integration tool to semi-automatically integrate 
schematic discrepant schemas from several relational databases. Next, we’ll try to 
extend our system to integrate databases in different models and semi-structured data. 
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Abstract. In this paper, we propose a new similarity measure between 
vague sets and apply vague logic in a relational database environment 
with the objective of capturing the vagueness of the data. By introduc- 
ing a new vague Similar Equality ( Seq ) for comparing data values, we 
first generalize the classical Functional Dependencies (FDs) into Vague 
Functional Dependencies (VFDs). We then present a set of sound and 
complete inference rules. Finally, we study the validation process of VFDs 
by examining the satisfaction degree of VFDs, and the merge-union and 
merge-intersection on vague relations. 



1 Introduction 

The relational data model [8] has been extensively studied for over three decades. 
This data model basically handles precise and exact data in an information 
source. However, many real life applications such as merging data from many 
sources involve imprecise and inexact data. It is well known that Fuzzy database 
models [11, 2], based on the fuzzy set theory by Zadeh [13], have been introduced 
to handle inexact and imprecise data. In [5], Gau et al. point out that the 
drawback of using the single membership value in fuzzy set theory is that the 
evidence for u £ U and the evidence against u £ U are in fact mixed together. 
(Here U is a classical set of objects, called the universe of discourse. An element 
of U is denoted by u.) Therefore, they propose vague sets, which is similar to that 
of intuitionistic fuzzy sets proposed in [1], A true membership function av(u) 
and a false membership function ftv( u ) are used to characterize the lower bound 
on (Here V means a vague set and F means a fuzzy set.) The lower 

bounds are used to create a subinterval [ay(ix), 1 — /3v(u)\ of the unit interval 
[0,1], where 0 < ay(u) < n f(u ) < 1 — /3v(u ) < 1, in order to generalize the 
membership function of fuzzy sets. 

There have been many studies which discuss the topic concerning how to 
measure the degree of similarity or distance between vague sets or intuitionistic 
fuzzy sets [3, 4, 7, 9, 12, 6]. However, the proposed methods have some limitations. 

* This work is supported in part by grants from the Research Grant Council of Hong 
Kong, Grant Nos HKUST6185/02E and HKUST6165/03E. 
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For example, Hong’s similarity measure in [7] means that the similarity measure 
between the vague value with the most imprecise evidence (the precision of the 
evidence is 0) and the vague value with the most precise evidence (the precision 
of the evidence is 1) is equal to 0.5. In this case, the similarity measure should 
be equal to 0. Our view is that the similarity measure should include two factors 
of vague values. One is the difference between the evidences contained by the 
vague values; another is the difference between the precisions of the evidences. 
However, the proposed measures or distances consider only one factor (e.g. in 
[3,4]) or do not combine both the factors appropriately (e.g. in [7,9, 12,6]). Our 
new similarity measure is able to return a more reasonable answer. 

In this paper, we extend the classical relational data model to deal with 
vague information. Our first objective is to extend relational databases to in- 
clude vague domains by suitably defining the Vague Functional Dependencies 
( VFDs ) based on our notion of similarity measure. A set of sound and complete 
inference rules for VFDs is then established. We discuss the satisfaction degree 
of VFDs and apply VFDs in merged vague relations as the second objective. 
The main contributions of the paper are as follows: (1) A new similarity measure 
between vague sets is proposed to remedy some problems for similar definitions 
in literature. We argue that our measure gives a more reasonable estimation; (2) 
A VFD is proposed in order to capture more semantics in vague relations; (3) 
The satisfaction degree of VFDs in merged vague relations is studied. 

The rest of the paper is organized as follows. Section 2 presents some basic 
concepts related to databases and the vague set theory. In Section 3, we propose 
a new similarity measure between vague sets. In Section 4, we introduce the 
concept of a Vague Functional Dependency (VFD) and the associated inference 
rules. We then explain the validation process which determines the satisfaction 
degree of VFDs in vague relations. In Section 5, we give the definitions of merge 
operators of vague relations and discuss the satisfaction degree of VFDs after 
merging. Section 6 concludes the paper. 

2 Preliminaries 

In this section, some basic concepts related to the classical relational data model 
and the vague set theory are given. 

2.1 Relational Data Model 

We assume the readers are familiar with the basic concepts of the relation data 
model [8] . There are two operations on relations that are particularly relevant in 
subsequent discussion: projection and natural join. The projection of a relation 
r of R(XYZ) over the set of attributes X is obtained by taking the restriction 
of the tuples of r to the attributes in X and eliminating duplicate tuples in 
what remains. This operation is denoted by nx(r) = {t[A] | t € r}. Let n 
and r 2 be two relations of R(XY) and R(XZ), respectively. The natural join 
ri ixi r 2 is a relation over R(XYZ) defined by r = rq xi r 2 = {t \ t[XY] £ 
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ri and t[YZ\ £ r 2 }- Functional Dependencies ( FDs ) are important integrity 
constraints in relational databases. An FD is a statement, X — » Y, where X 
and Y are sets of attributes. A relation r satisfies the FD, if for all t p and t q in 
r, t p [X ] = t q [X) implies t p [Y] = t q [Y]. 

2.2 Vague Data Model 

Let U be a classical set of objects, called the universe of discourse, where an 
element of U is denoted by u. 

Definition 1. (Vague Set) A vague set V in a universe of discourse U is 
characterized by a true membership function, ay, and a false membership func- 
tion, (3y , as follows: ay : U — > [0, 1 ],j3y : U — > [0, 1], and ay(u) + (3y(u) < 1, 
where ay(u) is a lower bound on the grade of membership of u derived from the 
evidence for u, and f3y(u) is a lower bound on the grade of membership of the 
negation of u derived from the evidence against u. 

Suppose U = {up U 2 , . ■ . , u n }. A vague set V of the universe of discourse 
U can be represented by V = Y^i=i[ a ( u i) ’ 1 — P( u i)]/ u i> where 0 < a{uf) < 
1 — f3{ui) < 1 and 1 < i < n. 

This approach bounds the grade of membership of u to a subinterval [ay(u), 
1 — /3y(u)\ of [0,1]. In other words, the exact grade of membership fiy(u) of 
u may be unknown, but is bounded by ay(u) < py{u) < 1 — / 3y(u ), where 
ay(u ) + Pv(u) < 1. We depict these ideas in Fig. 1. Throughout this paper, we 
simply use a and (3 for u if no ambiguity of V arising. 




Fig. 1. The true (a) and false (/3) membership functions of a vague set 



For a vague set [a(u),l — (3(u)\/u, we say that the interval [a(u),l — (3(u)\ 
is the vague value to the object u. For example, if [a(u),l — /3(u)\ = [0.6, 0.9], 
then we can see that a(u) = 0.6, 1 — (3{u) = 0.9 and (3{u) = 0.1. It is interpreted 
as “the degree that object u belongs to the vague set V is 0.6, the degree that 
object u does not belong to the vague set V is 0.1.” In a voting process, the 
vague value [0.6, 0.9] can be interpreted as “ the vote for resolution is 6 in favor, 
1 against, and 3 neutral (abstentious).” 
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The precision of the knowledge about u is characterized by the difference 
(1 — (3{u) — a(u)). If this is small, the knowledge about u is relatively precise; 
if it is large, we know correspondingly little. If (1 — (3(u)) is equal to a(u), the 
knowledge about u is exact, and the vague set theory reverts back to fuzzy set 
theory. If (1 — (3(u)) and a(u) are both equal to 1 or 0, depending on whether u 
does or does not belong to V, the knowledge about u is very exact and the theory 
reverts back to ordinary sets. Thus, any crisp or fuzzy value can be regarded as a 
special case of a vague value. For example, the ordinary set { tt} can be presented 
as the vague set [1, 1 ]/u, while the fuzzy set 0.8/u (the membership of u is 0.8) 
can be presented as the vague set [0.8, 0.8]/u. 

Definition 2. (Empty Vague Set) A vague set V is an empty vague set, if 
and only if, its true membership function a = 0 and false membership function 
(3 = 1 for all u. We use 0 to denote it. 

Definition 3. (Complement) The complement of a vague set V is denoted by 
V' and is defined by atv'(u) = / 3y(u ), and 1 — Pv'(u) = 1 — av(u). 

Definition 4. (Containment) A vague set A is contained in another vague 
set B, AC B, if and only if, a A {u) < Qb(u), and 1 — /3a(u) < 1 — /3b(u). 

Definition 5. (Equality) Two vague sets A and B are equal, written as A = 
B, if and only if, A C B and B C A; that is a A (u) = asiu), and 1 — Pa(u) = 
1 - p B (u). 

Definition 6. (Union) The union of two vague sets A and B is a vague set C, 
written as C = A\J B, whose true membership and false membership functions 
are related to those of A and B by ac(u) = max(aA(u), as(u)), and 1 — j3c{u) = 
max(l — /3a(u),1 — /3b(u )) = 1 — min(pA(u), Pb(u)). 

Definition 7. (Intersection) The intersection of two vague sets A and B is a 
vague set C, written as C = AnB, whose true membership and false membership 
functions are related to those of A and B by ac(u) = mdn(aA(u),aB(u)), and 
1 - /3c (u) = min{ 1 - /3a(u), 1 - Pb(u)) = 1 - max(pA(u), (3 B (u)). 

Definition 8. (Cartesian Product) Let U = U\ x U 2 x • • • x U m , be the 

Cartesian product of m universes, and A\,A 2 ,... ,A m be the vague sets in their 
corresponding universe of discourse U\, U 2 , ■ ■ • , U m , respectively, Ui £ Ui,i = 
1, . . . , m. The Cartesian product A = A\ x Ai x • • • x A m is defined to be a 
vague set of U = U\ x U 2 x • • • x U m , where the memberships are defined as 
follows: Ola [u\ ■ ■ ■ u m ) = min{a Aifui ), . . . , aA m ( u m )j , and 1 — /3a(ui ■ ■ ■ u m ) = 
min{{ 1 - p Al (ui)), . . . , (1 - P Am (u m ))} = 1 - max{/3 Al (ui), . . .,j3j m (« m )}. 



2.3 Vague Relations 

Definition 9. (Vague Relation) A vague relation r on a relation scheme R = 
{A\, A 2 , . . . , A m } is a vague subset of Dom(Ai) x Dom(A 2 ) x • • • x Dom(A m ). A 
tuple t = (ai, tt 2 , . . . , a m ) in Dom(Ai) x Dom(A 2 ) x • • • x Dom(A m ) is a vague 
subset of U = U\ x U 2 x • • • x U m . 
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A relation scheme R is denoted by R{A\, A 2 , . . . , A m j or simply by R if 
the attributes are understood. Corresponding to each attribute name A*, 1 < 
i < m, the domain of A,; is written as Dom(Ai). However, unlike classical and 
fuzzy relations, in vague relations, we define Dom(Ai) as a set of vague sets. 
Vague relations may be considered as an extension of classical relations and 
fuzzy relations, which can capture more information about imprecision. 

Example 1. Consider the vague relation r over Product (ID, Weight, Price) given 
in Table 1. In r, Weight and Price are vague attributes. To make the attribute 
ID simple, we express it as the ordinary value. The first tuple in r means the 
product with ID = 1 has the weight of [1, 1] / 10 and the price of [0.4, 0.6]/50 + 
[1, l]/80, which are vague sets. In the vague set [1, 1]/10, [1, 1] means the evidence 
in favor “the weight is 10” is 1 and the evidence against it is 0. 



Table 1. A Product Relation r 



ID 


Weight 


Price 


1 


M/io 


[0.4,0.6]/50+[l,l]/80 


2 


MM 


[l,l]/100+[0.6,0.8]/150 


3 


Ml/20 


[l,l]/100+[0.6,0.8]/150 


4 


[l,l]/10+[0.6,0.8]/15 


[1, l]/80+ [0.6,0. 8]/100 


5 


[0.6,0.8]/10+[l,l]/15+[0.6,0.8]/20 


[0.6,0.8]/60+[l,l]/90 



3 Similarity Measure Between Vague Sets 

In this section, we review the notions of similarity measures between vague sets 
proposed by Chen [3,4], Hong [7] and Li [9], together with distances between 
intuitionistic fuzzy sets proposed by Szmidt [12] and Grzegorzewski [6]. We show 
by some examples that these measures are not able to reflect our intuitions. A 
new similarity measure between vague sets is proposed to remedy the limitations. 

3.1 Similarity Measure Between Two Vague Values 

Let x and y be two vague values to a certain object such that x = [a x ,l ~ 
/3 X \, y = [ce y , 1 — fly]. In general, there are two factors should be considered in 
measuring the similarity between two vague values. One is the difference between 
the difference of the true and false membership values, which is given by D d = 
\{ot x -f3 x )-{a v ~f3 y )\/2 = \(a x —a y )—(f3 x —(3 y )\/2, such that 0 < D d < 1; another 
is the difference between the sum of the true and false membership values, which 
is given by D s = \ (a x + (3 x ) - (a y + (3 v ) \ = \[a x - a y ) + (/3 X - /3 y )\, such that 0 < 
D s < 1. The first factor implies the difference between the evidences contained 
by the vague values, and the second factor implies the difference between the 
precisions of the evidences. 

In [3,4], Chen defines a similarity measure between two vague values x and 
y as follows: 

\{a x — a y ) — (p x — fly) | 



M c (x,y ) = 1 



2 



(1) 





264 An Lu and Wilfred Ng 



which is equal to (1 — Dd). This similarity measure ignores the difference between 
the precisions of the evidences {D s ). For example, consider x = [0, 1], y = [a, 1 — 
a], 0 < a < 0.5, 



M c (x, y) = 1 - 



|(0- a) - (0 

2 




(2) 



This means that x and y are equal. On the one hand, x = [0, 1] means a x = 0 
and f3 x = 0, that is to say, we have no information about the evidence, and the 
precision of the evidence is zero. On the other hand, y = [a, 1 — a] means a x = a 
and /3 X = a, that is to say, we have some information about the evidence, and the 
precision of the evidence is not zero. So it is not intuitive to have the similarity 
measure of x and y being equal to 1. 

In order to solve this problem, Hong et al. [7] propose another similarity 
measure between vague values as follows: 

M H (x,y) = + (3) 



However, this definition also has some problems. Here is an example. 

Example 2. The similarity measure between [0,1] and [a, a], 0 < a < 1 is equal 
to 0.5. This means that the similarity measure between the vague value with 
the most imprecise evidence (the precision of the evidence is equal to zero) and 
the vague value with the most precise evidence (the precision of the evidence is 
equal to one) is equal to 0.5. However, our intuition shows that the similarity 
measure in this case should be equal to 0. 

Li et al. in [9] also give a similarity measure in order to remedy the problems 
in Chen’s definition as follows: 



Ml(x, y) = 1 - 



®-y) (.fix fiy) I T \&x QZy I T | fix fiy 



( 4 ) 



It can be checked that M^x,y) = ( Mc{x,y ) + Mn(x,y))/ 2. This means Li’s 
similarity measure is just the arithmetic mean of Chen’s and Hong’s. So Li’s 
similarity measure still contains the same problems. 

[12, 6] adopt Hamming distance and Euclidean distance to measure the dis- 
tances between intuitionistic fuzzy sets as follows: 



1. Hamming distance is given by 



D h (x, y) 



|o x Oy| T \P X (3y \ -\- \ (.Oi x &y) 4” ( fix fiy) \ 

2 ’ 



2. Euclidean distance is given by 



( 5 ) 



D E (x,y ) = 



(a x - a y ) 2 + ((3 X - fiy ) 2 + (( a x - a y ) + (/ 3 X - j3 y )) 2 



( 6 ) 



These methods also have some problems. Here is an example. 
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Example 3. We still consider the vague values x, y\ and 2/2 in Example 2. For the 
Hamming distance, it can be calculated that Dn{x,y i) = Dh{x, 2 / 2 ) = 0.6. This 
means that the Hamming distance between x and 2/1 are equal to that between 
x and 2 / 2 - In a voting process, as mentioned in Example 2, since both x and 
2/2 have identical votes in favor and against, the Hamming distance between x 
and 2/2 should be less than that between x and y\. For the Euclidean distance, 
consider the Euclidean distance between [0,1] and [a, a],0 < a < 1, which is 
equal to ( \J a 2 — a + 1). This means that the distance between the vague value 
with the most imprecise evidence and the vague value with the most precise 
evidence is not equal to 1. (Actually, the Euclidean distance in this case is in 
the interval [-^, 1).) However, our intuition shows that the distance in this case 
should always be equal to 1. 

In order to solve all the problems mentioned above, we define a new similarity 
measure between the vague values x and y as follows: 

Definition 10. (Similarity Measure Between Two Vague Values) 



M(x,y) = V(l-£> d )(l-D s ) 

= \f( 1 - l( °* ~ ^ ~ ~ ^ )(1 - IK - a,) + (ft, ~ fW (?) 

Furthermore, we define a distance between the vague values x and y as D( x, y) = 
1 - M(x,y). 



The similarity measure given in Definition 10 takes into account of both the 
difference between the evidences contained by the vague values and the difference 
between the precisions of the evidences. Here is an example. 

Example f. We still consider the vague values x, 2/1 and 2/2 in Example 2. It can 
be calculated that M(x,y\) = 0.53, M(x,y 2 ) = 0.63. So M(x,y±) < M(x,y 2 ). 
This means that the similarity measures between x and 2/1 are less than that 
between x and 2 / 2 - As mentioned in Example 2, this result is accordant to our 
intuition. Another example is the similarity measure between [0, 1] and [a, a], 0 < 
a < 1, which is equal to 0. This means that the similarity measure between the 
vague value with the most imprecise evidence and the vague value with the most 
precise evidence is equal to 0. This result is also accordant to our intuition. 

From Definition 10, we can obtain the following theorem. 

Theorem 1. The following statements are true: 

1. The similarity measure is bounded, i.e., 0 < M(x,y) < 1; 

2. M{x,y) = 1, if and only if, the vague values x and y are equal (i.e., x = y); 

3. M(x,y) = 0, if and only if, the vague values x and y are [0,0] and [1,1] or 
[0, 1] and [a, a] , 0 < a < 1; 

f. The similarity measure is commutative, i.e., M(x,y ) = M(y,x). 
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3.2 Similarity Measure Between Two Vague Sets 

We generalize the similarity measure to two given vague sets. 

Definition 11. (Similarity Measure Between Two Vague Sets) Let X 

and Y be two vague sets, where X = 1 — Px(ui)]/ui, and Y = 

1 — j3y(ui)\/ui. The similarity measure between the vague sets X 
and Y can be evaluated as follows: 

1 n 

M(X,Y) = — V' M([ax('Uk), 1 — Px{uk)], [ay(uk), 1 — /3y('«fc)]) 

n fc = i 

1 [7-, I (otx(uk) — ay(iik)) — (Px(uk) - (dyfuk)) L 

= n^V (1 2 

k— 1 

\f0-~- |(WY(ufc) — ay(«t)) + (0x(uk) — /3y(wfc))|). (8) 

Similarly, we give the definition of distance between two vague sets as D(X, Y) = 
1 - M(X, Y). 

From Definition 11, we obtain the following theorem for vague sets, which is 
similar to Theorem 1. 

Theorem 2. The following statements related to M(X,Y) are true: 

1. The similarity measure is bounded, i.e., 0 < M(X,Y) < 1; 

2. M(X,Y) = 1, if and only if, the vague sets X andY are equal (i.e., X = Y); 

3. AI(X,Y) = 0, if and only if, all the vague values [ax(«t), 1 — Px(uk)\ and 
[ay (uk), 1 — /Jy-(ufc)] are [0,0] and [1,1] or [0,1] and [a, a],0<a<l; 

f. The similarity measure is commutative, i.e., M(X,Y) = M[Y,X). 

4 Vague Functional Dependencies and Inference Rules 

In this section, we first give the definition of Similar Equality ( Seq ) of vague 
relations, which can be used to compare vague relations. Then we present the 
definition of a Vague Functional Dependency ( VFD ). Next, we present a set of 
sound and complete inference rules for VFDs, which is an analogy to Armstrong’s 
Axiom for classical FDs. 

4.1 Similar Equality of Vague Relations 

Similar Equality (Seq) of vague relations defined below can be used as a vague 
similarity measure to compare elements of a given domain. Suppose t p and t q 
are any two tuples in a relation r over the scheme R. 

Definition 12. (Similar Equality of Tuples) The Similar Equality of two 
vague tuples t p and t q on the attribute Ai in a vague relation is given by: 

SEQ(tp[Ai], t q [Ai]) 

1 V- IT. lOtpIAdfVfc) - CX t [ Ai] (u k )) - (Pt p {Ai]( u k) - Pt q [A t ](uk))\ . 

= »Av (1 2 1 

k = 1 

\j0-~ IKp[Ad( u fe) - a t 9 [A]( u fc)) + (A P hh]( w fc) - A,hh](wfc))D- 



(9) 
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The Similar Equality of two vague tuples t p and t q on attributes X = {Ai, . . . , 
A n } (X C R) in a vague relation is given by: 

S E Q(t p [X],t q [X]) = SEQ(t p [Ai ■ ■ ■ A n ], t. q [Ai ■ ■ ■ A n ]) 

= min{S E Q{t p [Ai],t q [Ai\), . . . , S E Q(t p [A n ],t q [A n \)}.( 10) 
From Definition 12 and Theorem 2, we have the following theorem. 

Theorem 3. The following statements of the properties of S E Q(t p [X],t q [X]) 
are true: 

1. The similar equality is bounded: 0 < S E Q(t p [X],t q [X]) < 1; 

2. S E Q(t p [X],t q [X]) = 1, if and only if all vague sets t p [A s ] and t g [A s ] (i < 
s < j) are equal (i.e., t p [A s ] = t q [A s ],i < s < j); 

3. S E Q(t p [X],t q [X]) = 0, if and only if 3A* G X, S E Q(t p [Ai], t q [Ai\) = 0, 
if and only if 3A; G X, all the vague values [a tp [ Ai \(u k ),l - Pt p [Ai](uk)] 
and [a tq [ Ai ](u k ), l - (3 tq [ Ai ](u k )} are [0,0] and [1,1], or [0,1] and [a, a], where 
0 < a < 1; 

f. The similar equality is commutative: S E Q(t p [X],t q [X}) = S E Q(t q [X], t p [X]). 

4.2 Vague Functional Dependencies 

Informally, a VFD captures the semantics of the fact that, for given two tuples, 
Y values should not be less similar than X values. We now give the following 
definition of a VFD. 

Definition 13. (Vague Functional Dependency) Given a relation r over 
a relation schema R(Ai, A 2 , . . . , A m ), where Dom^Af) ( i = l,...,m) are sets 
of vague sets, a Vague Functional Dependency (VFD) X Y where 1,7 C 
R holds over r, if for all tuples t p and t q in r, we have S E Q(t p [X],t q [X]) < 
S EQ (t p [Y],t q [Y]). 

In the database literature [8] , a set of inference rules is generally used to derive 
new data dependencies from the given set of dependencies. We now present a set 
of sound and complete inference rules for VFDs, which is similar to Armstrong’s 
Axiom for FDs. 

Definition 14. (Inference Rules) Let us consider a relation scheme R(A\, 
A 2 , . . . , A m ) and a set of VFDs F. Let X , Y, and Z be subsets of the relation 
scheme R. We define a set of inference rules as follows: 

1. Reflexivity: IfY C X, then X Y; 

2. Augmentation: If X Y holds, then XZ YZ also holds; 

3. Transitivity: If X Y and Y Z hold, then X Z holds. 

The following theorem follows by assuming that there are at least two ele- 
ments a and b in each data domain such that Seq(o , b) = 0. 

Theorem 4. The inference rules given in Definition If are sound and complete. 

The Union, Decomposition, Pseudotransitivity rules follow from these three 
rules, as in the case of functional dependencies [8]. We skip the proof due to 
space limitation. 




268 An Lu and Wilfred Ng 



4.3 Validation of VFDs 



In this section, we study the validation issues of VFDs. We relax the notion 
that if a VFD does not hold for a pair of tuples in r, then the VFD does not 
hold. We allow the VFD to hold with a certain satisfaction degree over r. The 
validation process and the calculation of the satisfaction degree of the VFD 
X > A are given as follows: 

1. For every attribute A,; in X U A, we calculate SEQ(t p [Ai\, t q [Aj\) between 
every pair of tuples t p and t q in r by constructing two n x n (n is the 
cardinality of r) upper triangular matrices X and A. The row and column 
represent a comparison of different tuples. We ignore the lower part of the 
matrix and the diagonal, since Seq is commutative. Thus we get n(n— l)/2 
entries in the matrix. Each entry is the comparison of a pair of tuples; 

2. We check S EQ(t p [X] 1 t q [X]) < SEQ(t p [A],t q [A]) for every t p , t q in r. If true, 
then we say that the VFD X c — > A holds (with the satisfaction degree of 1). 
We construct a matrix W = X — A to check this; 

3. If the result in Step 2 is not true, in the matrix W, we count the number 
of entries (denoted by s) which are less than or equal to 0. The satisfaction 
degree SD of the VFD X A in r can be calculated as follows: 



SD = 



s 



n(n— 1 ) \ ’ 
2 ) 



(ii) 



Obviously, if the inequality given in Definition 13 holds for all tuples in r, 
the satisfaction degree calculated by (11) is equal to 1. 

Suppose there are many VFDs hold over relation r, say /i, f 2 , . . . , f n , with 
the satisfaction degrees SD\, SD 2 , ■ ■ ■ , SD n respectively. We use a VFD set 
F = {/i, / 2 , • • • , f n } to present this. Then the satisfaction degree of the VFD 
set F over relation r can be calculated by the arithmetic mean of the satisfaction 
degrees of F as follows: 



SD f 



SD\ + SD 2 + • • • + SD n 
n 



(12) 



Here is an example to illustrate the validation process and the calculation of 
the satisfaction degree of the VFD. 

Example 5. Consider the vague relation r presented in Table 1, it can be checked 
that the VFD Weight c -> Price holds to a certain satisfaction degree. 

In step 1, we calculate SEQ(t p [Ai\,t q [Ai\) for attributes X = Weight and 
A = Price and the results are shown by matrix X and A or Tables 2 and 3. 

In step 2, we check SEQ(t p [X], t q [X]) < S EQit p [A\,t q [A\) by taking the dif- 
ference between the two matrices X and A. The result is shown by matrix W or 
Table 4. 

Since SEQ(t p [X], t q [X]) < SECj(tp[A], t q [A]) does not hold for every p, q, we 
go to step 3. 

In step 3, we get s = 5. So the satisfaction degree SD can be calculated as 
follows: 
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X = 



/ - 0 0 0.74 0.41 \ 

1 0.16 0.41 

0.16 0.41 

0.66 



V J 



A = 



/- 0.28 0.28 0.71 0.28 \ 

' - - 1 0.41 0.24 ' 

- - - 0.41 0.24 

- - - - 0.24 

v- - - - - / 



Table 2. Weight 



Tuples 


1 


1 


1 


4 


m 


1 


B 


1 


i 


m! 




2 




- 


1 




0.41 


3 












4 








- 


BIH 


5 








- 





Table 3. Price 



Tuples 


i 


m 


m 


4 


m 


1 












2 




~ 


i 


mil 




3 




- 


- 






4 




~ 


- 


- 


ggg 


5 




~ 


- 


- 





SD = 



s 

n{n— 1 ) 
2 



5 



= 0.5. 



(13) 



Therefore, the VFD Weight Price over relation r holds with the satisfaction 
degree 0.5. 

Furthermore, for the zero entries in W, we check the corresponding values in 
the matrix X. If the values are equal to 1, all vague sets (t. p [Ai\ and t q [Ai\) (A,; in 
X) are equal according to Theorem 3. Thus, we can remove some redundancies 
by decomposing the original relation into two relations. 

For instance, there is a value in position (3,2) is 0 in W above. We check the 
corresponding value in position (3,2) in matrix A, and find the value is 1. So the 
vague relation in Table 1 can be decomposed into two relations IW(ID, Weight), 
WP (Weight, Price) (Tables 5 and 6), and some redundancies have been removed. 



5 Merge Operations of Vague Relations 

In this section, we first give the definition of merge operators of vague relations 
and then discuss the evaluation of the satisfaction degree of VFDs over the 
merged vague relations. 
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Table 5. IW Table 6. WP 



Weight 


Price 


[Ml/10 


[0.4,0.6]/50,[l,l]/80 


[1,11/20 


[l,l]/100+[0.6, 0.8]/150 


[l,l]/10+[0.6, 0.8]/15 


[l,l]/80+[0.6, 0.8]/100 


[0.6,0. 8]/10+[l,l]/15 
+[0.6,0.8]/20 


[0.6,0.8]/60+[l,l]/90 



ID 


Weight 


1 


[Ml/10 


2 


[1,11/20 


3 


[1,11/20 


4 


[l,l]/10+[0.6, 0.8]/15 


5 


[0.6, 0.8]/10+[l,l]/15+[0.6, 0.8]/20 



5.1 Merge Operators 

Generally speaking, when multiple data sources merge together, the result may 
contain objects of three cases [10]: (1) an attribute value is not provided; (2) 
an attribute value is provided by exactly one source; (3) an attribute value is 
provided by more than one source. When merging vague data, in the first case, 
we use an empty vague set to express the unavailable value; in the second case, 
we keep the original vague set; in the third case, we take the union of the vague 
sets provided by the source. We now define two new merge operators to serve 
our purpose. 

Definition 15. (Join Merge Operator) Lett r be a tuple in the vague relation 
r over scheme R = (Ai,A 2 , ■ . . , A m ) and t s be a tuple in the vague relation s 
over scheme S = {A\, At , . . . , A n ). r and s have a common ID attribute A\. 
The attributes A, , . . . , A m are common in both vague relations. Then we define 
the join merge of r and s, denoted by r A s, as follows: r A s = (t|3t r £ r,t s € 
s witht[Ai\ = t r [A\] = t s [Ai] , t[Aj\ = t r [Aj],j = 2 — l;t[Aj] = t r [Aj ] U 
t s [Aj],j = i, . . . , to; t[Aj\ = t s [Aj],j = to+1, . . . , n}, where t r [Aj\l)t s [Aj\ means 
the union of two vague sets as defined in Definition 6. 

Definition 16. (Union Merge Operator) Let r' = r — ttr(t As), s' = s — 
7rs(rAs). Then we define the union merge of r and s, denoted by rVs, as follows : 
rVs= (r A s) U (t|Vt r / € r' with t[Aj] = t r >[Aj\,j = 1, ... ,m;t[Aj] = 0, j = 
m+1,. ?r}U{t|Vt s / e s' with t\Af\ = t s '[A\\, t[Aj\ = 0, j = 2, . . . , * — 1; t[Aj] = 
t s '[Aj],j = *,... ,n}, where 0 means an empty vague set. 

Since vague sets have the property of associativity given in [5], the join merge 
operator and the union merge operator also have the property of associativity. 
That is to say, r A (s A t) = (r A s) A t and rV(sVt) = (rVs)Vt (recall that r, 
s, t are vague relations). We can also generalize Definitions 15 and 16 to more 
than two data sources. Definition 16 guarantees that every tuple is contained in 
the new merged relation. For example, consider the following vague relations r 
and s given in Tables 7 and 8. We then have (r A s) and (r V s) as given in Tables 
9 and 10. 

5.2 Satisfaction Degree of Merged Relations 

Suppose we have m data sources represented by the vague relations r\, . . . ,r m . 
Each relation (1 < * < to) has a set of VFDs, Fj (1 < i < to), with 
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Table 7. Vague Relation r 



Mi 


A2 


A3 


1 


IUJ/2 


0 


2 


0 


[0.3,0.7]/a+ 

[0.6,0.8]/c 


3 


[0.2,0.3]/6+ 
[0.5, 0.^/8 


[0.7, 0.9] /b+ 
[0.5,0.9]/d 



Table 8. Vague Relation s 



Mi 


A3 


m 4 


1 


[0.1, 0.4] /a 


[l,l]/x+[0.6,0.8]/z 


3 


[0.2,0.8]/a+ 

[0.6,0.8]/d 


0 


5 


[0.2,0.3]/b+ 

[0.5,0.7]/f 


[0. 7,0.9] /s+ 
[0.5,0.6]/t 



Table 9. Vague Relation r A s 



Mi 


A2 


A3 


m 4 


1 


[l.l]/2 


[0.1,0.4]/a 


[l,l]/x+[0.6,0.8]/z 


3 


[0.2,0.3]/6+[0.5,0.7]/8 


[0.2,0.8]/a+[0.7,0.9]/b+[0.6,0.9]/d 


0 



the satisfaction degree SDp t defined in ( 12 ). By the union merge operator, we 
get a new relation r = rq V • • • V r m . We can also get a new VFD set F = 
.FiUi^U- • -U F m over r. For each VFD in F, we can calculate the new satisfaction 
degree over r by the validation process proposed in Sect. 4 . Then the satisfaction 
degree SDp of the new VFD set F over relation r can be calculated by ( 12 ). 

In the case of non-overlapping sources, we can simplify the calculation as 
follows. Assume two data sources represented by the vague relations, ri and r2, 
which have the same VFD X ^ A on a common schemas. We let the satisfaction 
degree be SD\ and SD2, and the cardinalities of rq and r2 are ci and C2. (As 
the sources are non-overlapping, there exists no tuple which has the same value 
of A\ (the ID attribute) in both rq and r2-) This implies that the cardinality of 
rq V V2 is (ci + C2). In order to calculate the new SD of X > A over rq V V2, we 
need to construct two new (ci x C2) matrices, X' and A', to calculate the Seq 
of every pair of tuples between rq and rq . Then we need to construct a matrix 
W' = X' — A' and count the number of entries (denoted by s'), which are less 
than or equal to 0 in W' . According to ( 11 ), the satisfaction degree SD of the 
VFD X ^ A over n V r 2 , where C = (ci + C2)(ci + C2 — 1 ), can be calculated 
as follows: 



sd = + Vzl^Isdi + A 



(14) 



Table 10. Vague Relation r V s 



Mi 


A2 


A3 


m 4 


1 


[1,11/2 


[0.1,0.4]/a 


[l,l]/x+[0.6,0.8]/z 


~T 


0 


[0.3,0.7]/a+[0.6,0.8]/c 


0 


T 


[0.2,0.3]/6+[0.5,0.7]/8 


[0.2,0.8]/a+[0.7,0.9]/b+[0.6,0.9]/d 


0 


5 


0 


[0.2,0.3]/b+[0.5,0.7]/f 


[0.7,0.9]/s+[0.5,0.6]/t 
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6 Conclusions 

In this paper, we incorporate the notion of vagueness into the relational data 
model, with an objective to provide a generalized approach for treating impre- 
cise data. We propose a new similarity measure between vague sets, which gives 
more reasonable estimation than those proposed in literature. We apply Similar 
Equality (S eq) in vague relations. The equality measure can be used to compare 
elements of a given vague data domain. Based on the concept of similar equality 
of attribute values in vague relations, we develop the notion of Vague Functional 
Dependencies ( VFDs ), which is a simple and natural generalization of classical 
or fuzzy functional dependencies. In spite of this generalization, the inference 
rules for VFDs share the simplicity of Armstrong’s axiom for classical FDs. We 
also present the validation process of VFDs and the formula to determine the 
satisfaction degree of VFDs. Finally, we give the definition of merge operators 
of vague relations and discuss the satisfaction degree of VFDs over the merged 
vague data. As a future work, we plan to extend the merge operations over vague 
data, which provide a flexible means to merge data in modern applications, such 
as querying internet sources and merging the returned result. We are also study- 
ing the notion of Vague Inclusion Dependencies, which is useful to generalize the 
foreign keys in vague relations. 
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Abstract. How to deal with the heterogeneous structures of XML docu- 
ments, identify XML data instances, solve conflicts, and effectively merge 
XML documents to obtain complete information is a challenge. In this 
paper, we define a merging operation over XML documents that can 
merge two XML documents with different structures. It is similar to a 
full outer join in relational algebra. We design an algorithm for this oper- 
ation. In addition, we propose a method for merging XML elements and 
handling typical conflicts. Finally, we present a merge template XML hie 
that can support recursive processing and merging of XML elements. 



1 Introduction 

Information about real world objects may spread over heterogeneous XML doc- 
uments. Moreover, it is critical to identify XML data instances representing the 
same real world object when merging XML documents, but each XML document 
may have different elements and/or attributes to identify objects. Furthermore, 
conflicts may emerge when merging these XML documents. 

In this paper, we present a new approach to merging XML documents. Our 
main contributions are as follows. First, we define a merging operation over 
XML documents that is similar to a full outer join in relational algebra. It can 
merge two XML documents with different structures. We design an algorithm 
for this operation. Second, we propose a method for merging XML elements and 
handling typical conflicts. Finally, we present a merge template XML file that 
can support recursive processing and merging of XML elements. 

The rest of the paper is organized as follows. Section 2 defines the merging 
operation and presents the algorithm for this operation. Section 3 studies the 
mechanism for identifying XML instances. Section 4 examines XML documents 
that this algorithm produces. Section 5 demonstrates the method for merging 
elements and handling conflicts. Section 6 describes the merge template XML 
file. Section 7 discusses related work. Finally, Section 8 concludes this paper. 

2 Our Approach 

The merging operation to be defined can merge two XML documents that have 
different structures, and create one single XML document. We assume that two 
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<Factory Name=“Red Cap”> 

<Department DName= “Production” > 

< Employees > 

< Employee > 

<Name>Paul Smith</Name> 

< Age>35</Age> 

< Contact > 

< Phone >5555</P hone > < P hone > 1 1 1 1 < /Phone > 

< Address > < N umber > 78 < /N umber > 

< Street > Main Street < / Street > < / Address > 
<Email>paul@redcap.ca< /Email> 

< /Contact > 

< /Employee > 

< / Employees > 

< /Department > 

<Department DName= “Sales” > 

< Employees > 

< Employee > 

<Name>Paul Smith</Name> 

< Age>40< /Age> 

< Contact > 

<Phone>8888< /Phone> 

< Address > < N umber > 1 0 < /Number > 

< Street > Prince Road < / Street > < / Address > 
<Email>paul2@redcap.ca< /Email> 

</ Contact > 

< /Employee > 

< / Employees > 

< /Department > 

</ Factory > 



Fig. 1 . F\\ the first XML document to be merged. 



XML documents to be merged share many tag names and also have some tags 
with different tag names. We also assume that two tags that share the same tag 
name in these two XML documents describe the same kind of objects in the real 
world but their corresponding elements may have different structures. 

This merging operation can be formally represented as: 

F 3 : = merging (F u F 2 ) on {D- u D 2 , Pi, P 2 , E- u E 2 ) 
where F\ and F 2 are two input XML documents to be merged and F 3 is the 
merged XML document; D\ and D 2 are the DTDs of F\ and F 2 \ P\ and P 2 are 
absolute location paths (paths for short) in XPatlr that designate the elements 
to be merged in F\ and F 2 respectively; E\ and E 2 are Boolean expressions that 
are used to control merging of XML elements in F\ and F 2 . 

Boolean expression Ei is used to identify XML instances when merging F\ 
and F 2 . Also, it is used for merging of XML elements and handling conflicts. It 
consists of a number of conditional expressions connected by Boolean operator 
A. Let ei be one of the elements whose path is P\ in F\ and e 2 one of the elements 
whose path is P 2 in F 2 . E\ determines if e\ in F\ and e 2 in F 2 describe the same 
object. As long as E\ is true, ei in F\ and e 2 in F 2 describe the same object 
and they are merged. We say that ei in F\ and e 2 in F 2 are matching elements 
if they describe the same object. Boolean expression E 2 is used to determine if 
e 2 in F‘2 that does not have a matching e\ in F\ will be incorporated into F 3 . It 
consists of several conditional expressions connected by Boolean operator A. 
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<FactoryInfo> 

<Introduction>Found in 2000</Introduction> 

< People > 

<Person PName=“Paul Smith” > 

< WorkIn> < Factory > Red Cap</Factory > 
<Unit>Production</Unit> 

<Group>One< /Group></WorkIn> 

< Age>36< /Age> 
<Position>Engineer</Position> 
<Phone>llll</Phone> 
<Address><Number>78</Number> 

<Street>Main Street</Street> 
<PostCode>K5S 8E2</PostCode></ Address> 
</Person> 

<Person PName= “Alice Bush”> 

< WorkIn> < Factory > Red Cap</Factory > 
<Unit>Production</Unit> 

<Group>Two< /Group></WorkIn> 

< Age>45< /Age> 
<Position>Technician</Position> 
<Phone>7777</Phone> 
<Address><Number>10</Number> 

<Street > Kingsway< /Street > 

<PostCode>L5S 8E2</PostCode> < / Address> 
</Person> 

</People> 

< /FactoryInfo> 



Fig. 2. F2: the second XML document to be merged. 



Example 1. The two input XML documents F\ and F2 in Figures 1 and 2 have 
different structures. They describe employees by different elements: Employee 
elements in Fi and Person elements in F^. Pi and P2 are merged into P3 shown 
in Figure 3. The merge conditions are as follows: 

Pi = /Factory /Department /Employees /Employee. 

P2 = / Factorylnfo / People /Person. 

Ei = (:: Department/ ©DName = Workln/ Unit) A ( Name = @PName). 

E 2 = (:: Department/ ©DName = Workln/ Unit). 

According to the above Pi and P2, Employee elements in Fi and Person 
elements in P2 are merged into the result XML document P3. Thus, for this 
example, ei is any Employee element in Fi and e2 is any Person element in P2. 

In the above Ei, -.-.Department /©DName represents the attribute DName of 
the ancestor Department of Employee element in Pi and Workln/ Unit denotes 
the child Unit of the child Workln of Person element in P2 (:: and @ denote an 
ancestor and an attribute respectively). According to Pi, an Employee element in 
Pi and a Person element in P2 describe the same employee and are merged into 
an Employee element in P3 if the value of the attribute DName of the ancestor 
Department of an Employee is the same as the content of the descendant Unit 
of a Person , and the content of the child Name of an Employee is the same as 
the value of the attribute PName of a Person. Note that the child Name of an 
Employee cannot identify an Employee in Pi because two Department elements 
may have Employee descendants that have the same content for the child Name. 
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<Factory Name=“Red Cap”> 

<Department DName= “Production” > 

< Employees > 

<Employee PName=“Paul Smith” > 
<Age>35|36</Age> 

<Contact> 

<Phone>5555</Phone> <Phone>llll</Phone> 
<Address> <Number>78</Number> 
<Street>Main Street</Street> 

<PostCode>K5S 8E2</PostCode> < / Address> 
<Email>paul@redcap.ca< /Email> 

</Contact> 

<WorkIn> <Group>One</Group> </WorkIn> 
<Position>Engineer</Position> 

< /Employee> 

<Employee PName= “Alice Bush”> 

<WorkIn> <Group>Two</Group> </WorkIn> 
<Age>45</Age> 

<Position> Technician < /Posit ion > 
<Phone>7777</Phone> 

<Address> <N umber > 1 0 < /N umber > 
<Street>Kingsway</Street> 

<PostCode>L5S 8E2</PostCode></Address> 

< /Employee> 

< /Employees> 

< /Department > 

<Department DName= “Sales” > 

< Employees > 

<Employee PName=“Paul Smith” > 
<Age>40</Age> 

<Contact> 

<Phone>8888</Phone> 

<Address> <Number>10</Number> 
<Street>Prince Road</Street></ Address> 
<Email>paul2@redcap.ca</Email> 

</Contact> 

< /Employee> 

< /Employees> 

< /Department > 

< /Factory > 



Fig. 3. F 3 : the resulting single XML document. 



According to E2, if there exists a Department in F\ that has an attribute 
DName whose value is the same as the content of the descendant Unit of a non- 
matching Person, this non-matching Person is incorporated into F3. Otherwise, 
this non-matching Person cannot be incorporated into F 3 because no element 
in F[ can have this Person as a descendant. 

In relational algebra, a full outer join extracts the matching rows of two tables 
and preserves non-matching rows from both tables. Analogously, the merging op- 
eration defined merges XML documents F\ and F2 that have different structures 
and creates an XML document F 3 . It merges ei in F\ and its matching e2 in 
F2 according to D\, D2, and E\. It incorporates each modified non-matching ei 
in Fj and some modified non-matching e2 elements in Fj based on D\, D2, Fj, 
and Fj. Moreover, it incorporates the elements in F\ that do not need merging. 

Path a is the prefix path of path (3 if a is the left part of [3 or a is equal to 
(3. For example, /x is the prefix path of / x / y. It is obvious that the path of any 
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<Factory Name=“Red Cap”> 

<Department DName= “Production” > 

<Employees> 

<Employee PName=“Paul Smith” > 
<Age>35|36</Age> 

<Contact> 

<Phone>5555</Phone><Phone>llll</Phone> 
<Address><Number>78</Number> 
<Street>Main Street</Street> 

<PostCode>K5S 8E2< /PostCode> < / Address> 
<Email>paul@redcap.ca</Email> 

</Contact> 

< WorkIn> < Group> One< /Group> < / WorkIn> 
<Position>Engineer< /Position> 

< /Employee> 

< /Employees> 

< /Department > 

<Department DName= “Sales” > 

<Employees> 

<Employee PName=“Paul Smith” > 
<Age>40</Age> 

<Contact> 

<Phone>8888</Phone> 

<Address><N umber > 1 0 < /N umber > 
<Street>Prince Road < /St reet X / Address > 
<Email>paul2@redcap.ca</Email> 

</Contact> 

< /Employee> 

< /Employees> 

< /Department > 

< /Factory > 



Fig. 4. Floj'- the XML document procedure LeftOuterJoin produces for Example 1. 



ancestor of an element is the prefix path of the path of this element. Path 7 is 
the parent path of path 6 if 7 is the prefix path of 6 and S contains one more 
element name than 7. For Example 1 , / Factory / Department / Employees is the 
parent path of P \ . 

The algorithm for the merging operation is as follows. 

Algorithm xmlmerge 

Input: F 1 , F 2 , D\, D 2 , Pi, P 2 , Ei, and E 2 . 

Output: F 3 . 

777 := the root element of F\; 

call LeftOuterJoin (F\, F 2 , 777, D 1, D 2 , Pi, P 2 , E\, F L oj ); 

/* Floj is the XML document generated by procedure LeftOuterJoin */ 
r F loj ;= the root element of Floj ; 

call FullOuter Join ( F L oj , F 1 , F 2 , r FLOJ , L>i, D 2 , Pi, P 2 , E\, E 2 , F 3 ) 

End of algorithm xmlmerge 

Algorithm xmlmerge merges Fi and F 2 , and generates an XML document 
F 3 , which contains every element merged from ei in Fi and its matching e 2 in 
F 2 , each modified non-matching ei in Fi, and some modified non-matching e 2 
elements. Also, F 3 incorporates the elements in Fi that do not need merging. 
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Algorithm xmlmerge calls two recursive procedures LeftOuterJoin and Full- 
Outer Join. We explain FullOuterJoin in Section 4. LeftOuterJoin is as follows. 
Procedure LeftOuterJoin (Pl, P 2 , e Fl , D i, P 2 , Pi, P2 , Pi, Floj) 
if the path of e Fl in Pl is not equal to Pi 

then 

output the start tag of e Fl to Floj', 
output all the attributes of e Fl to Floj', 
for each child element c of e Fl 
if the path of c is the prefix path of Pl 

then call LeftOuterJoin (Pi, P 2 , c, D\, P 2 , Pi, P 2 , Pi, Floj) 
else copy c to Ploj; 
output the end tag of e Fl to Floj 
else 

if e Fl has a matching element e F , 2 in P 2 

then 

output the start tag of e Fl to Ploj ; 

for every attribute a\ of e Fi call processal (e F2 , ai, Pi, P 2 , Pi, Floj)', 
for every attribute a 2 of e F , 2 call processa2 (e Fi , a 2 , Pi, P 2 , Pi, Floj)', 
for every child element Ci of e 

call processcl (e Fl , e F2 , ci, Pi, P 2 , Pi, Ploj); 
for every child element c 2 of ei?2 

call processc2 (e Fl , eF 2 , c 2 , Pi, P 2 , Pi, Ploj); 
output the end tag of to Ploj 
else 

output the start tag of ep, to Floj', 
output all the attributes of e Fl to Floj', 
for every child element c of 

if c has a semantically corresponding attribute a that is an attribute of e 2 

then 

output an attribute to Floj whose attribute name is that of a and 
whose value is the content of c; 
for every child element c of ej?, 

if c does not have a semantically corresponding attribute that is an 
attribute of e 2 
then copy c to Ploj; 
output the end tag of e Fl to Floj', 

End of procedure LeftOuterJoin 



Procedure LeftOuterJoin merges ei in F\ and its matching e 2 in P 2 and 
resolves conflicts by calling procedures processal , processa2, processcl , and pro- 
cessed, and produces an XML document Floj, which contains every element 
merged from ei in Pi and its matching e 2 in P 2 , every modified 11011 -matching ei 
in Pl, and the elements in Pl that do not need merging. For Example 1, XML 
document Ploj that LeftOuterJoin produces is presented in Figure 4. 
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3 Instance Identification 

A Skolem function returns a value for an object as the identifier of this object [ 4 ]. 
The computation of Boolean expression Ej has the equivalent effects as a Skolem 
function does. For Example 1 , the constructed Skolem function concatenates 
the attribute DName of the ancestor Department and the child Name of an 
Employee element in Ej, or the descendant Unit and the attribute PName of a 
Person element in F2 and returns this concatenated value for an object as the 
identifier. As long as two identifiers for two objects described in Fj and F2 are 
equivalent, these two objects are the same object. 

4 The Generated XML Documents 

In LeftOuterJoin (Fj, F2, e^, D\, D2 , Pi, P2, Ed, Floj ), ejr, is the currently 
processed element in Ej and it always has the property: ejy is one of the ele- 
ments in Ej that need merging, or ejy does not need merging but some of the 
descendants of cf, need merging. 

Assume y is an element in Ej that needs merging, and x is an element in 
Ei that does not need merging and x is not a descendant of y. We consider the 
relationship between x and y in Fj . There are four cases: 

( 1 ) x is an ancestor of y. ( 2 ) a; is a sibling of an ancestor of y. 

( 3 ) x and y are siblings. ( 4 ) a; is a descendant of a sibling of y. 

For these four cases, y is merged with its matching element in Fj and x is 
incorporated into Floj- When y does not have a matching element in Fj, y is 
modified and incorporated into Floj- We consider Example 1 . The Employee in 
Floj that has “Paul Smith” as the attribute PName and “Production” as the 
attribute DName of the ancestor Department is merged from the Employee in 
Fj that has “Paul Smith” as the child Name and “Production” as the attribute 
DName of the ancestor Department and the matching Person in Fj that has 
“Paul Smith” as the attribute PName and “Production” as the descendant Unit. 
The Employee in Fj that has “Paul Smith” as the child Name and “Sales” as the 
attribute DName of the ancestor Department is a non-matching Employee. It 
is modified and incorporated into Floj- Its child Name is changed to attribute 
PName to obey the structure of the merged Employee in Floj- Department and 
Employees do not need merging and they are incorporated into Floj- 

FullOuterJoin incorporates every element in Floj into XML document Fj, 
and modifies some non-matching e2 elements and inserts the modified non- 
matching e2 elements into Fj as child elements of some elements whose path 
is the parent path of P\. FullOuterJoin modifies some non-matching e2 elements 
in order to resolve conflicts and make the non-matching e2 elements obey the 
structure of the merged element in Floj - Let us examine Example 1 . The Person 
in Fj that has “Alice Bush” as the attribute PName and “Production” as the 
descendant Unit is a non-matching Person. This non-matching Person in Fj and 
the Employees in Floj that has the ancestor Department that has the attribute 
DName with value “Production” make Boolean expression Ej true. Therefore, 
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this non-matching Person is modified and embodied into F3 as a child element 
of this Employees. The element name of this non-matching Person is changed 
to Employee. The child Workln of this non-matching Person is modified. 



5 Merging XML Elements 

and Handling Typical Conflicts 

First, we rephrase the assumptions about F\ and F2. 

(1) i 7 ! and F‘i share many tag names and also have some tags with different tag 
names. 

( 2 ) Two tags that share the same tag name in F\ and F-i describe the same kind 
of objects in the real world, but the corresponding elements can have the 
same structure or have different structures. 

( 3 ) For tags with different tag names in Fj and F2, some of them can still de- 
scribe the same kinds of objects. In this case, Boolean expression E\ indicates 
that they describe the same kinds of objects. 

( 4 ) For two tags in F\ and F2 that describe the same kind of objects, the cor- 
responding elements have the same cardinality. 

( 5 ) For two elements whose tags describe the same kind of objects in Fj and 
F2, their two attributes have the same attribute type and the same default 
value if these two attributes have the same attribute name in F\ and F2. 

Then, we introduce several notions. 

Elements whose tags describe the same kind of objects in Fi and F2 can be 
classified into two categories: semantically identical elements and semantically 
corresponding elements. Two elements in Fi and F2 are semantically identical 
elements if their tags describe the same kind of objects and they have the same 
structure. Two semantically identical elements can have different element names. 
In this case, E\ indicates they describe the same kind of objects. Two elements 
in F\ and F2 are semantically corresponding elements if their tags describe the 
same kind of objects but they have different structures. Also, two semantically 
corresponding elements can have different element names. In this case, E\ indi- 
cates they describe the same kind of objects. It is true that e\ in F\ and e2 in F 2 
are semantically corresponding elements because they actually express the same 
kind of objects and they describe the same object if they make Fi true. 

Two attributes in Fi and F2 are said to be semantically identical attributes 
if they have the same name, and one is an attribute of an element in F\ and the 
other is an attribute of the semantically identical or corresponding element of 
this element in F2. Similarly, two semantically identical attributes can have dif- 
ferent names. In this case, they are specified as semantically identical attributes 
in E\. An attribute in one XML file to be merged can have a semantically cor- 
responding element in another XML file to be merged. An attribute and an 
element are a pair of semantically corresponding attribute and element if the 
name of this attribute is the same as the name of this element, this attribute is 
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an attribute of element g in one XML file and this element is a child element 
of the semantically corresponding element of element g in another XML file, 
this attribute is a required attribute and of type CDATA, and this element is 
specified as a parsed character data element with cardinality 1 and it does not 
have any attribute. Also, the attribute name and element name of a pair of se- 
mantically corresponding attribute and element can be different. In this case, E\ 
indicates they are a pair of semantically corresponding attribute and element. 

We present the method for merging elements and handling conflicts. 

Conflicts may emerge when LeftOuterJoin merges ejy in F\ and its matching 
eF 2 in F‘i into an element in Fpoj- Let ai be an attribute of ejy and 02 an 
attribute of eF 2 ■ Let Ci be a child element of e f 1 and C2 a child element of eF 2 ■ 
Typical conflicts are: conflicts between a 1 and 02, conflicts between ai and C2, 
conflicts between C\ and <22, conflicts between C\ or a descendant of C\ and C2 
or a descendant of C2, and conflicts between C2 or a descendant of C2 and an 
ancestor of • 

If attribute ai of 6 f 1 in F\ has a semantically identical attribute that is 
an attribute of ep 2 in F2, ai and its semantically identical attribute should 
be merged into an attribute. If ai and its semantically identical attribute are 
consistent with each other, redundancy is eliminated by merging them into one 
attribute; otherwise, a conflict is indicated in the merged attribute. Similarly, if 
attribute a\ of e f, in F\ has a semantically corresponding element that is a child 
element of ep 2 in F2, a\ and its semantically corresponding element are merged 
into an attribute. Procedure processal accomplishes these tasks. 

In Example 1 , the child element Name of Employee in F\ and the attribute 
PName of Person in F2 are semantically corresponding element and attribute 
because ( Name = @PName ) is specified in Boolean expression E\. They are 
combined into the attribute PName of the merged Employee element in Fpoj ■ 

The relationship between a descendant of e,p 1 and a descendant of Cf 2 is 
illustrated in Figure 5 where (e) shows no correspondence of an element and its 
semantically identical or corresponding element is found. 

Assume that d\ is a descendant of e/y , c?2 is a descendant of e^ 2 , and d\ and 
c?2 are semantically corresponding or identical elements. Based on the assump- 
tions about the two XML documents to be merged, d\ and c?2 have the same 
cardinality. If the cardinality is not greater than 1 , d\ and d2 are merged into an 
element and conflicts between them are reported. Otherwise, d\ and c?2 cannot be 
simply merged into an element. When d\ and e?2 describe the same object, they 
are merged into an element; conversely, both d\ and c?2 are incorporated into 
the merged element in Fpoj- Moreover, when d\ and d2 are semantically cor- 
responding elements that represent the same object, if c?2 has some attributes 
and/or descendants that di does not have, an element that has both the at- 
tributes and descendants of d\ and the extra attributes and/or descendants of 
c?2 is incorporated into the merged element in Fpoj as a descendant. 

Recursive procedure processcl is responsible for completing the above tasks. 
We consider Example 1 again. The child Age of Employee in F\ and the child 
Age of Person in F2 are semantically identical elements with cardinality 1 . They 
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Fig. 5. The relationship between ci and C 2 . 



are combined into the child Age of the merged Employee in Floj ■ Floj reports 
a conflict: the child Age of the merged Employee in Floj has content “35|36”. 
The content “35|36” is an or- value and it implies it is not clear which one is the 
correct one [7]. The child Contact of Employee in F\ contains Phone, Address, 
Email child elements. The child Phone of Contact and the child Phone of Person 
are semantically identical elements. The cardinality of Phone is greater than 1, 
so the child Phone of the child Contact and the child Phone of Person are 
usually fused into the Phone child elements of the child Contact of the merged 
Employee element in Floj- The child Address of the child Contact of Employee 
and the child Address of Person are semantically corresponding elements with 
cardinality 1. There are no conflicts between them, and the child Address of 
Person has a child PostCode that the child Address of Contact does not have, 
and as a result, this child PostCode is added to the child Address of the child 
Contact of the merged Employee in Floj- The child Email of the child Contact 
of Employee is embodied in the child Contact of the merged Employee in Floj- 

Assume that descendant d of eF 2 has a semantically corresponding or iden- 
tical element that is an ancestor of Cf, in F\. To deal with d, two solutions 
are possible. One is to simply include d into the merged element in Floj- This 
results in a typical conflict: a conflict between C 2 or a descendant of C 2 and an 
ancestor of e^- Another is to simply exclude d. This also has a problem: if d 
contains some descendants that are not semantically corresponding or identical 
elements of any ancestors of ejy , the information about these descendants of d 
is lost in the merged element in Floj- It is appropriate to reconcile these two 
opposing solutions by modifying d and incorporating this modified d into the 
merged element in Floj- 

Recursive procedure processc2 carries out the tasks described above. Let 
us examine Example 1. The child Workln of Person in F -2 has Factory, Unit, 
and Group child elements. The child Factory of the child Workln of Person in 
F ‘2 and the ancestor Factory of Employee in F\ are semantically corresponding 
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elements. Also, the child Unit of the child Workln of Person in F 2 and the ances- 
tor Department of Employee are semantically corresponding elements because 
{'.'.Department. / @DName = Workln/ Unit ) is specified in Boolean expression E\. 
Consequently, Workln that has only child element Group is included into the 
merged Employee in Floj as a child element. 

6 A Merge Template XML File 

In our implementation, a merge template XML file is created to express Pi, P 2 , 
Ei, and Ei- Figure 6 shows an example merge template XML file for Example 
1 where MergeTemplate has three child elements: PI, P2, and Key. PI and P2 
indicate the paths of elements to be merged in Pi and P 2 respectively. Key gives 
the information for identifying XML instances and handling typical conflicts. 

The order of element names in Pi or P 2 is significant. The first one is the 
name of the root element of the corresponding XML document and the last 



< MergeTemplate > 

<P1 Path= “/Factory/Department /Employees/Employee” /> 

<P2 Path=“/FactoryInfo/People/Person”/><Key> 

<Factor Namel=“::Department/@DName” Name2=“Wor kln/Unit” 

Selected = “Yes” / > 

<Factor Namel=“Name” Name2=“@PName” Function=“samenarae”/></Key> 
< /MergeTemplate> 



Fig. 6. An example merge template XML file. 



one indicates the name of the elements to be merged. Moreover, each pair of 
consecutive element names in a path is associated with a pair of a parent and a 
child in the corresponding XML document, and the child element in each pair of 
a parent and a child in Pi associated with a pair of consecutive element names 
in Pi is the only kind child that needs merging or has descendant elements that 
require merging. All these characteristics are used to support recursive processing 
of XML elements in Pi and merging of designated elements in Pi and P 2 . 

Each child Factor of Key describes a conditional expression in E\ and a Fac- 
tor that has a Selected attribute with value “Yes” also describes a conditional 
expression in P 2 . In Sections 2 and 3, we assume that XML data in Pi and XML 
data in P 2 specified in each conditional expression in Pi have the same represen- 
tations. In fact, they may have different formats. We define Boolean functions 
to solve this problem. Consequently, the mechanism presented combines Skolem 
function and user-defined Boolean functions to identify XML instances. Boolean 
function samename (nl, n2) specified in Figure 6 returns true if nl and n2 
actually refer to the same name although they have different formats. 

7 Related Work 

Bertino et al. point out that XML data integration involves reconciliation at 
data model level, data schema level, and data instance level [2]. In this paper, we 
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mainly focus on reconciliation at data instance level to merge XML documents 
that have different structures. 

A lot of research in semantic integration of XML data has been conducted [3, 
10]. Castano et al. propose a semantic approach to integration of heterogeneous 
XML data by building a domain ontology [3]. Rodrfguez-Gianolli et al. present a 
framework that can provide a tool to integrate DTDs into a common conceptual 
schema [10]. Several systems for processing XML or XML streams are devel- 
oped [8, 9] . The Niagara system focuses on providing query capabilities for XML 
documents and can handle infinite streams [9]. Lore is a semi-structured data 
repository that builds a database system to query XML data [8]. The merging 
operation defined in this paper is not available in any of these works or systems. 

A lot of research in merging or integration of XML data that has similar or 
identical structures has been done [6, 7] . A data model for semi-structured data is 
introduced and an integration operator is defined in [7]. This operator integrates 
similarly structured XML data. Lindholm designs a 3-way merging algorithm 
for XML files that comply with an identical DTD [6]. The mechanism proposed 
in this paper can merge two XML documents that have different structures. 

Merge Templates that specify how to recursively combine two XML docu- 
ments are introduced by Tufte et al. [12]. Our work is different from their work 
in several aspects. First, the Merge operation proposed by Tufte et al. combines 
two similarly structured XML documents to create aggregates over streams of 
XML fragments. Second, a method for merging XML elements and handling 
typical conflicts is proposed in this paper. 

When merging XML documents, it is critical to identify XML data instances 
representing the same object of the real world. Albert uses the term instance 
identification to refer to this problem [1]. This problem has been investigated [1, 
5] . These papers propose different methods to deal with this problem. A universal 
key is used in [1] . Lim et al. define the union of keys of the data sources [5] . How- 
ever, these works deal with databases and support typed data. Skolem function 
is introduced in [4]. It returns a value for an object as the identifier of this ob- 
ject. Saccol et al. present a proposal for instance identification based on Skolem 
function [11]. The mechanism presented in this paper combines Skolem function 
and Boolean functions defined for designers [11] to identify XML instances. 

8 Conclusion 

We have defined a merging operation over XML documents that is similar to a 
full outer join in relational algebra. It can merge two XML documents with dif- 
ferent structures. We have implemented a prototype to merge XML documents. 

We plan to investigate other operations over XML documents, such as inter- 
section and difference. 
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Abstract. An effective solution to automate information integration 
is represented by wrappers, i.e. programs which are designed for ex- 
tracting relevant contents from a particular information source, such as 
web pages. Wrappers allow such contents to be delivered through a self- 
describing and easily processable representation model. However, most 
existing approaches to wrapper designing focus mainly on how to gen- 
erate extraction rules, while do not weigh the importance of specifying 
and exploiting the desired schema of the extracted information. In this 
paper, we propose a new wrapping approach which encompasses both 
extraction rules and the schema of required information in wrapper def- 
initions. We investigate the advantages of suitably exploiting extraction 
schemata, and we define a clean declarative wrapper semantics by intro- 
ducing (preferred) extraction models for source HTML documents with 
respect to a given wrapper. 



1 Introduction 

Information available on the Web is mainly encoded into the HTML format. 
Typically, HTML pages follow source- native and fairly structured styles, thus 
are ill-suited for automatic processing. However, the need for extracting and in- 
tegrating information from different sources into a structured format has become 
a primary requirement for many information technology companies. For exam- 
ple, one would like to monitor appealing offers about books concerning specific 
topics. Here, an interesting offer may consist in finding highly-rated books. 

In this context, an effective solution to automate information integration is 
related to the exploitation of wrappers. Essentially, wrappers are programs de- 
signed for extracting relevant contents from a particular information source (e.g. 
HTML pages) , and for delivering such contents through a self-describing and eas- 
ily processable representation model. XML [19] is widely known as the standard 
for representing and exchanging data through the web, therefore successfully 
fulfills the above requirements for a wrapping environment. 

Generally, a wrapper consists of a set of extraction rules which are used 
both to recognize relevant content portions within a document and to map them 
to specific semantics. Several wrapping technologies have been recently devel- 
oped: we mention here TSIMMIS [8], FLORID [15], DEByE [13], W\F [18], 
XWrap [14], RoadRunner [2], and Lixto [1] as exemplary systems proposed by 
the research community. Traditional issues concerning wrapper systems are the 
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development of powerful languages for expressing extraction rules and the ca- 
pability of generating these rules with the lowest human effort. Such issues can 
be addressed by a number of approaches, such as wrapper induction based on 
learning from annotated examples [6,9,12,17] and the visual specification of 
wrappers [1] . The first approach suffers from negative theoretical results on the 
expressive power of learnable extraction rules, while visual wrapper generation 
allows the definition of more expressive rules [7]. However, although the schema 
of the required information should be carefully defined at the time of wrapper 
generation, most existing wrapper designing approaches focus mainly on how 
to specify extraction rules. Indeed, while generating wrappers, such approaches 
ignore the potential advantages coming from the specification and usage of the 
extraction schema, that is the desired schema of the documents to be created to 
contain the extracted information. A specific extraction schema can aid to recog- 
nize and discard irrelevant or noisy information from documents resulting from 
the data extraction, thus improving the accuracy of a wrapper. Furthermore, 
the extracted information can be straightforwardly used in the data integration 
process, since it follows a specific organization best reflecting user requirements. 

As a running example, consider an excerpt of Amazon page displayed in Fig.l, 
and suppose we would like to extract the title, the autlror(s), the customer rate 
(if available), the price proposed by the Amazon site, and the publication year, 
for any book listed in the page. The extraction schema for the above information 
can be suitably represented by the following DTD: 

<! ELEMENT doc (store) > 

<! ELEMENT store (book+)> 

<!ELEMENT book (title, author+, (customer_rate I no_rate) , price, year)> 

<! ELEMENT title (#PCDATA)> 

<! ELEMENT author (name)> 

<! ELEMENT name (#PCDATA)> 

<! ELEMENT customer_rate (rate)> 

<! ELEMENT no.rate EMPTY> 

<! ELEMENT rate (#PCDATA)> 

<! ELEMENT price (#PCDATA)> 

<! ELEMENT year (#PCDATA)> 

It is easy to see that such a schema allows the extraction of structured in- 
formation with multi- value attributes (operator +) , missing attributes (operator 
?), and variant attribute permutations (operator I). 

As mentioned above, existing wrappers are not able to specify and exploit 
extraction schemata. Some full-fledged systems describe a hierarchical structure 
of the information to be extracted [1, 17], and they are mostly capable of speci- 
fying constraints on the cardinality of the extracted sub-elements. However, no 
such system allows complex constraints to be expressed: for instance, it is not 
possible to require that element customer_rate may occur alternatively to el- 
ement no_rate. As a consequence, validating the extraction of elements with 
complex contents is not allowed. 

Two preliminary attempts of exploiting information on extraction schema 
have been recently proposed in the information extraction [10] and wrapping [16] 
research areas. In the former work, schemata represented as tree-like structures 
do not allow alternative subexpressions to be expressed. Moreover, a heuristic 
approach is used to make a rule fit to other mapping rule instances: as a con- 
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Fig. 1. Excerpt of a sample Amazon page from www.amazon.com 



sequence, rule refinement based on user feedback is needed. In [10], DTD-style 
extraction rules exploiting enhanced content models are used in both learning 
and extracting phases. 

[3] is related to a particular direction of research: turning the schema match- 
ing problem into an extraction problem based on inferring the semantic cor- 
respondence between a source HTML table and a target HTML schema. The 
proposed approach differs from the previous ones related to schema mapping 
since it entails elements of table understanding and extraction ontologies. In 
particular, table understanding strategies are exploited to form attribute-value 
pairs, then an extraction ontology performs data extraction. 

It is worth noticing that all the above approaches lack a rigorous formalism 
for the specification of extraction rules. Moreover, they do not define any model 
for the construction of the documents into which the extracted information has 
to be inserted. 

Our contributions can be summarized as follows. We propose a novel wrap- 
ping approach which improves standard approaches based on hierarchical extrac- 
tion by introducing the presence of extraction schema in the wrapper generation. 
Indeed, a wrapper is defined by specifying, besides a set of extraction rules, the 
desired schema of the XML documents to be built from the extracted infor- 
mation. The schema availability not only allows the extracted XML documents 
to be effectively used for further processing, but also allows the exploitation 
of simpler rules for extracting the desired information. For instance, to extract 
customer_rate from a book, a standard approach should express a rule extract- 
ing the third row of a book table only if this row contains an image displaying 
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the “rate”. The presence of the extraction schema allows the definition of two 
simple rules, one for customer_rate element and one for its rate subelement: 
the former extracts the third row of the book table, while the latter extracts an 
image. Moreover, our approach in principle does not rely on any particular form 
of extraction rules, that is any preexisting kind of rules can be easily plugged in; 
however, we show that XPath extraction rules are particularly suitable for our 
purposes. Finally, we define a clean declarative semantics of schema-based wrap- 
pers: this is accomplished by introducing the concept of extraction models for 
source documents with respect to a given wrapper, and by identifying a unique 
preferred model. 

2 Preliminaries 

Any XML document can be associated with a document type definition (DTD) 
that defines the structure of the document and what tags might be used to 
encode the document. A DTD is a tuple V = (El,P,e r ) where: i) El is a finite 
set of element names, ii) P is a mapping from El to element type definitions, 
and Hi) e r £ El is the root element name. An element type definition is a 
one-unambiguous regular expression a defined as follows 1 : 

— a - 

— Oil 

— Oi2 

where e £ El, #PCDATA is an element whose content is composed of character 
data, EMPTY is an element without content, and ANY is an element with generic 
content. An element type definition specifies an element-content model that con- 
strains the allowed types of the child elements and the order in which they are 
allowed to appear. A recursive DTD is a DTD with at least a recursive element 
type definition, i.e. an element whose definition refers to itself or an element 
that can be its ancestor. In other terms, a recursive DTD admits documents 
such that an element e may contain (directly or indirectly) an element of the 
same type. For the sake of presentation clarity, we refer to DTDs which do not 
contain attribute lists. As a consequence, we consider a simplified version of 
XML documents, whose elements have no attributes. 

In our domain, the application of a wrapper to a source document can pro- 
duce several candidate document results. A desirable property of a wrapping 
framework should be that of producing results that are ordered with respect to 
some criteria in order to identify a unique preferred extraction document. 

We accomplish this objective by exploiting partially ordered regular expres- 
sions [4], i.e. an extension of regular expressions where a partial order between 
strings holds. A partially ordered language over a given alphabet E is a pair 
( L,>l ), where L is a (standard) language over E (a subset of E + ) and >l 

1 The symbol || denotes different productions with the same left part. Here we do not 
consider mixed content of elements [19] . 



Off || Ct2, 

(aq) II aq | aq II aq,aq II aq? II aq* II e, 

ANY II EMPTY || #PCDATA, 
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is a partial order on the strings of L. Ordered regular expressions are defined 
by adapting classical operations for standard languages to partially ordered lan- 
guages. In particular, a new set of strings and a partial order on this set can 
be defined for the operations of prioritized union , concatenation , and prioritized 
closure between languages [4]. 

Let £ be an alphabet. The ordered regidar expressions over £, and the sets 
that they denote, are defined recursively as follows: 

1. 0 is a regular expression and denotes the empty language (0, 0); 

2. for each a € £, a is a regular expression and denotes the language ({a}, 0); 

3. if «i and «2 are regular expressions denoting languages L(a i) and L(a 2 ), 
respectively, then i) a\ + a .2 denotes the prioritized union language L(a i) ® 
L(ai 2 ), ii) ai «2 denotes the concatenation language L(ai)L(a. 2 ), Hi) ot\ de- 
notes the prioritized closure language L(ai) > . 

Proposition 1. Let a be a one-unambiguous ordered regular expression. The 
language L(a) is linearly ordered. □ 

3 Schema-Based Wrapping Framework 

In the following we describe our proposal to extend traditional hierarchical 
wrappers in such a way they can effectively benefit from exploiting extraction 
schemata. To this purpose, we do not focus on a particular extraction language, 
but investigate how to build documents, for the extracted information, that 
are valid with respect to a predefined schema. Indeed, our approach can prof- 
itably employ different kinds of extraction rules. Therefore, before describing the 
schema-based wrapping approach in more detail, we introduce a general notion 
of extraction rule. 

We assume any source HTML document is represented by its parse tree, 
also called as XHTML document. Generally, each extraction rule works on a 
sequence of nodes of an HTML parse tree, providing a sequence of sequences of 
nodes. Notice that working on a tree-based model for HTML data is not a strong 
requirement, and can be easily relaxed. However, for the sake of simplicity, we 
do not refer to string-based extraction rules like those introduced in [1, 11, 17]. 

Definition 1 (Extraction rule). Given an HTML parse tree doc and a se- 
quence s p of nodes in doc, an extraction rule r is a function associating s p with 
a sequence S of node sequences. □ 

Extraction rules so defined can be seen as a generalization of Lixto extraction 
filters. The main difference with respect to Lixto filters is that our rules allow 
the extraction of non-contiguous portions of an HTML document. However, an 
extraction rule is not able to contain references to elements extracted by different 
rules. 

Moreover, we define a special type of extraction rules which turn out to be 
particularly useful to address the problem of wrapper evaluation [5] . 
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Definition 2 (Monotonic extraction rule). Given a sequence s p of nodes in 
an HTML parse tree doc, a monotonic extraction rule r is a function associating 
s p with a sequence S of node sequences such that, for each sequence s £ S and 
for each node n £ s, there exists n' £ s p which is ancestor of n. □ 

Let us now introduce our notion of wrapper. A wrapper is essentially com- 
posed of: i) the desired schema V of the information to be extracted from HTML 
documents, and ii) a set 77 of extraction rules. As in most earlier approaches 
(such as [1, 17]), the extraction of the desired information proceeds in a hierar- 
chical way. The formal definition of a wrapper is provided below. 

Definition 3 (Wrapper). Let V = ( El,P,e r ) be a DTD, 77 be a set of ex- 
traction rules, and w be a function associating each pair (ei,ej) of elements 
ej, ej £ El with a rule r £ 77. A wrapper is defined as W77 = (77, w). □ 

In practice, a wrapper associates the root element e r of the DTD with the root 
of the HTML parse tree to be processed, then it recursively builds the content 
of e r by exploiting the extraction rules to identify the sequences of nodes that 
should be extracted. In other terms, once an element e has been associated with 
a sequence s of nodes of the source document, an extraction rule r is applied to 
s to identify the sequences that can be associated with the children of e. 

In order to devise a complete specification of a wrapper, we further propose 
an effective implementation of extraction rules based on the XPath language [20] . 

3.1 XPath Extraction Rules 

The primary syntactic construct in XPath is the expression. An expression is 
evaluated to yield an ordered collection of nodes without duplicates, i.e. a se- 
quence of nodes. In this work, we consider XPath expressions with variables. 
The evaluation of an XPath expression occurs with respect to a context and a 
variable binding. Variable bindings represent mappings from variable names to 
sequences of objects. Formally, given a variable binding 6 and a variable name 
$v, we denote with 9($v ) the sequence associated to $v by 9. Moreover, given 
two disjoint variable bindings 9\ and 0 2 , we denote with 9\ o0 2 a variable binding 
such that, for each $v, 0i o 0 2 ($u) = 0i($u) (resp. 9\ o 0 2 ($u) = 0 2 ($u)) if 0i($i>) 
(resp. 0 2 ($«)) is defined, otherwise 9\ o 0 2 ($u) is undefined. Given an XPath 
expression p, an XHTML document doc , a sequence of nodes s, and a variable 
binding 0, p(s, 0, doc) denotes the sequence of nodes provided by p when p is 
evaluated on doc , starting from s and according to 0. 

The relation between the result of an XPath expression and a variable is 
represented by the concept of XPath predicate , which is formally defined as 
follows. 

Definition 4 (XPath predicate). Given a set {$Ui, . . . , $w„, $c, $it} of vari- 
ables and an XPath expression p using the variables $iq, . . . , $v n , we denote an 
XPath predicate with $c : p — > $u. Moreover, we denote a subsequence XPath 
predicate with $c : p -» $u. 
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Given an XHTML document doc and a variable binding 9, an XPath predicate 
$c : p — » $w is true with respect to 9 if 9($u) = p(9($c),9, doc). Analogously, a 
subsequence XPath predicate $c : p -» $u is true with respect to 9 if 9($u) is a 
subsequence of p(0($c) , 9, doc) . □ 

Moreover, we consider an order on node sequences which is defined ac- 
cording to the document order. Given two sequences s' = [n \ , . . . , n' k ] and 
s" = [n '{, . . . , n'f], si precedes S 2 (si A S 2 ) if there exists an index i such 
that n' = n" and n\ < n", for each j < i, or s' is a prefix of s" . Given an 
XHTML document doc , a variable binding 9 and a subsequence XPath predi- 
cate $c : p — » $u, we denote with eval(%c : p -» $u, 0) the sequence of node 
sequences [si, . . . , s*,] such that s t -< Sj, for each i < j, and $c : p -» $u is true 
with respect to 9 o {$u/s.;}, for each i. 

XPath predicates are the basis of more complex concepts, such as extraction 
filters and extraction rules. An extraction filter is defined over both a target 
predicate and a set of other predicates which act as filter conditions. 

Definition 5 (XPath extraction filter). Given a set of variables {$ui,..., 
$w„, $u}, an XPath extraction filter is defined as a tuple / = (tp, V ), where: 

— tp is a target predicate, that is a subsequence XPath predicate defining 
variable $u on the empty set of variables; 

— V is a conjunction of predicates defined on variables {$iq, . . . , $v n , $«.}. □ 

The application of an XPath extraction filter / = (tp, V) to a sequence 
s = [rii, . . . , nfc] of nodes yields a sequence of node sequences f(s) = [si, . . . , s^] 
where: 1) Si < Sj, for each i < j, 2) Si € eval(tp , {$u/s}), for each i € [l..fc], and 
3) there exists a substitution 9, which is disjoint with respect to {Su/s, Sc/s^}, 
such that each XPath predicate in V is true with respect to 9 o {$u/s, $c/si}. 

We devise any extraction rule as a composition of two kinds of filters: extrac- 
tion filters and external filters. The latter specify conditions on the size of the 
extracted sequences. In particular, we consider the following external filters: 

— an absolute size condition filter as specified by bounds (min, max) on the 
size of a node sequence s, that is as(s) is true if min < size(s) < max ; 

— a relative size condition filter rs specified by policies {minimize, maximize}, 
that is, given a sequence S of node sequences and a sequence s £ S, rs(s, S) 
is true if rs = minimize (resp. rs = maximize) and there not exists a sequence 
s' £ S, s' ^ s, such that s' C s (resp. s' D s). 

Definition 6 (XPath extraction rule). An XPath extraction rule is defined 
as r = ( EF,as,rs ), where EF = /1 V . . . V / m is a disjunction of extraction 
filters, as and rs are external filters. □ 

For any sequence s of nodes, the application of an XPath extraction rule r = 
(EF, as, rs) to s yields a sequence of node sequences r(s), which is constructed as 
follows. Firstly, we build the ordered sequence EF(s) = [si, . . . , s^] = U^ 1 /*(s), 
that is the sequence obtained by merging the sequences produced by each ex- 
traction filter fi £ EF applied to s. Secondly, we derive the sequence of node 
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sequences S' = [sii, . . . , Sj/t], h < k by removing from EF(s) all the sequences 
Si £ EF(s) such that as(si) is false. Finally, we obtain r(s) by removing from 
S' all the sequences £ S' such that rs(sij, S') is false. 

Example 1. Suppose we are given an extraction rule r = (/iV fy, (2,4), minimize), 
where filters /i and fy are defined respectively as: 

fi = (Sc : //a -» Si, 

{$i : [child: :* [last ()] [name() = ‘ c ’] ] — > $ui, 

$i : [child [position()=l] [name() = ‘d’] ] — * $112}), 

/2 = (Sc : //a/b -» Si 

{Si: [child: :* [last ()] [name() = ‘d’ ] ] —> $vi, 

Si: [child: :* [count (*)= 2 ] ] — ► S^})- 

Consider now the document tree sketched below, and suppose we apply the rule 
r to the sequence of nodes s = [1, 2, 3]. 




The target predicate of f\ returns the sequence [[5], [5,7], [5,7,8], [7], [7,8], [8]], 
which is turned into [[5], [5, 8], [8]] by applying conditions in f\. Analogously, the 
target predicate of fy returns the sequence [[11], [11, 13], [11, 13, 14], [11, 13, 14, 
16], [13], [13, 14], [13, 14, 16], [14], [14, 16], [16]], which is simplified in [13, 14]. The 
union between /1 and / 2 is computed as / 1 V /2 = [[5], [5, 8], [8], [13, 14]]. By 
applying the external filters it can straightforwardly derived that the resulting 
sequence is [[5, 8], [13, 14]]. 



4 Wrapper Semantics 

In this section we provide a clean declarative semantics for schema-based wrap- 
pers. This is accomplished by introducing the notion of extraction models for 
source HTML documents with respect to a given wrapper. Extraction models 
are essentially collections of extraction events. An extraction event models the 
extraction of a subsequence by means of an extraction rule which is applied to 
a context, that is a specific sequence of nodes. However, not all the extraction 
events turn out to be useful for the construction of the XML document dedicated 
to contain the extracted information: extraction models are able to identify those 
events that can be profitably exploited in building an XML document. 
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4.1 Extraction Events and Models 

The notion of extraction model relies strictly on the notion of extraction event. 
An extraction event happens whenever an extraction rule is applied. We assume 
that each extraction event is associated with a unique identifier. 

Definition 7 (Extraction event). Given a target element name e f and an as- 
sociated node sequence s t , an extraction event e is a tuple £ = (pid, id, et, St,pos), 
where id and pid denote the identifiers of the current and parent extraction event, 
respectively, and pos denotes the position of £ relative to event pid. □ 

In order to build an XML document to be extracted by a wrapper, we have to 
consider sets of extraction events. However, only some sets of extraction events 
correspond to a valid document. Therefore, we have to carefully characterize 
such sets of extraction events. To this purpose, let us introduce some preliminary 
definitions on properties of sets of extraction events. 

To begin with, a set £ of extraction events is said to be well-formed if the 
following conditions hold: 

— there not exist two events (pid,id,et, St.,pos) and (pid' ,id, e' t , s' t ,pos') in £ 
such that pid ^ pid' V et ^ e! t V s* ^ s' t V pos ^ pos ' , i.e. an extraction event 
must have a unique identifier; 

— there not exist two events (pid, id, et, St,pos) and (pid, id' ,e' t , s' t ,pos) such 
that id ^ id' , i.e. two sibling events cannot refer to the same position; 

— there not exist two events (pid,id,et, St,pos) and (pid, id' ,et,St,pos') such 
that id ^ id', i.e. two identical node sequences cannot be associated to the 
same element. 

Notations for handling well-formed sets of extraction events are introduced 
next. Given a set £ of extraction events and a specific event e v £ £ identified 
by pid, we denote with £(pid) C £ the set containing all the extraction events 
which are children of e p , i.e. £(pid) = {£ | £ = (pid, id, et, St, pos) € £}. We 
further describe two simple functions, namely elnames and linearize, that pro- 
vide flat versions of a set of extraction events. Given an event identifier pid 
and a set £ of extraction events, we denote with linearize (£ (pid)) the list of 
extraction events in £(pid) such that the events are ordered by position. More- 
over, we denote with elnames (£ (pid)) the sequence of element names corre- 
sponding to linearize (£ (pid)): formally, elnames (£ (pid)) = [e°,...,e k ], where 
linearize(£ (pid)) = [(pid, id 0 , e°, s°,pos°), . . . , (pid, id k , e k , s k ,pos k )}. 

Extraction events need to be characterized with respect to their conformance 
to a given regular expression specifying an element type. Given a regular expres- 
sion a on an alphabet of element names, and an event identifier pid, we say 
that £(pid) is valid for a if elnames (£ (pid)) spells a, i.e. the string formed by 
concatenating element names in elnames (£ (pid)) belongs to the language L(a). 

We are now able to characterize the validity of a set of extraction events 
with respect to the definition of an element. Let V = (El, P, e r ) be a DTD and 
WR = (V, w) be a wrapper. We say that a well-formed set £ of extraction events 
is valid for an element name e £ El if the following conditions hold: 
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Fig. 2. Sketch of HTML parse tree of page in Fig.l 



— P(e) = EMPTY, or P(e) = #PCDATA, or 

— for each extraction event (ppid 7 pid 7 e,s,pos) £ £: 

• £(pid) is valid for P(e), and 

• for each event (pid 7 id 7 et 7 St 7 pos) £ £(pid), St £ w (e, e*)(s), and 

• there not exist two extraction events (pid 7 id 7 et 7 St 7 pos) and (pid 7 id' ,et, 
s' t7 pos ') in £{pid) such that pos > pos' and St does not precede s' t in 
w(e, et)(s), and 

• £{pid) contains k extraction events such that linearize(£(pid )) = [{pid, 
id° 7 e° 7 s^ 7 pos 0 ) 7 ... 7 (pid 7 id k 7 e k 7 s k 7 pos k )] 7 pos l <pos l+1 ,i£ [0,k — 1]. 

An extraction model is essentially a well-formed set of extraction events that 
conform to the definition of all the elements appearing in the DTD specified 
within a wrapper. Moreover, an extraction model can be represented by a tree 
of extraction events. 

Definition 8 (Extraction Model). Let D = ( El,P,e r ) be a DTD, WK = 
(T>, w) be a wrapper, doc be an XHTML document, and £ be a well-formed set 
of extraction events. £ is said to be an extraction model of doc with respect to 
W1Z (for short, £ is an extraction model of WlZ(doc)) if: 

— £ corresponds to a tree T = (r, N, E, A), where r = (0,0, e r , [0], 0) £ £, N is 
the set of extraction events, E is formed by pairs (e*, Ej) such that £j, Ej £ £ 
and e i is the parent event of Ej , and A is a function associating an identifier 
to each extraction event; 

— for each extraction event (pid,id,et, St 7 pos) £ £, £{id) is valid for et . □ 

Example 2. Consider again the Amazon page displayed in Fig.l, and suppose 
that such a page is subject to a wrapper based on the DTD presented in the 
Introduction. The extraction rules used by this wrapper are reported on the third 
column of Table 1; we assume that (1,1) and minimize are adopted as default 
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external filters. The first column reports the target element names associated 
to each rule, whereas the parent element names can be deduced by the DTD. 
Extraction events occurring in the example model are reported on the second 
column of the table. For the sake of simplicity, we focus only on a portion of the 
document doc corresponding to the page of Fig. 1; the parse tree associated with 
doc is sketched in Fig. 2. Therefore, we consider only some events, according to 
the portion of page we have chosen. Event £o = (0, 0, doc, [0], 0) occurs implicitly 
in the model under consideration, thus it is not extracted by any rule. 

Offered books are stored into a unique table, which is extracted by event E\ 
using filter f store- This filter fulfills the requirement that the book table has to 
be preceded by a simpler table containing a selection list. Information about 
any book is stored into a separate table which consists of two parts: the first one 
contains a book picture, while the second one is another table divided into eight 
rows, one for each specific information about the book. 

Let us consider the first instance of book, whose subtree is rooted in node 25 
of the parse tree. The book, which is identified by event £2 using filter fb 00 k , 
has information on title, (one) author, year, customer rate, and price. The set 
of events which are children of £2 is built as £(2) = {£ 3 , £ 4 , £ 5 , £q, eg, £ 9 }. Even 
though information on customer rate is available from the first instance of book, 
we can observe that event £s happens for element no_rate: however, such an 
event cannot appear in the model, because £(2) would not be a valid content for 
an element of type book. 

It is worth noting that rules for extracting information on both availability and 
unavailability of customer rate have been intentionally defined as identical in 
this example. However, both kinds of extraction events occur only in any book 
having customer rate, while only event for element no_rate is extracted from 
any book which has not customer rate. This happens since it is not possible that 
an event for rate occurs as a child of an event for no_rate. 

> 

An extraction model is implicitly associated with a unique XML document, 
which is valid with respect to a previously specified schema. Given a DTD 
V = ( El,P,e r ), a wrapper W1Z = (D,w), an XHTML document doc, and an 
extraction model £ of W7 Z(doc), we define the function buildDoc which takes £ 
and an event £ £ £ as input and returns the XML fragment relative to s. For 
any event £ = (pid, id, e, s,pos), buildDoc(£, e) is recursively defined as follows: 

— if P(e) = EMPTY then buildDoc(£ , s) = <e/>; 

— if P(e) = #PCDATA then buildDoc (£ , e) = <e>text(s)</ e>; 

— if P(e) is a regular expression then buildDoc{£,e) = <e>buildDoc(£,e 1 ) + 
... + buildDoc(£ , £*,)</ e> , where linearize (£ (id)) = [ei, ...,£/,-]. 

In the above definitions, text(s) denotes the concatenation of the string values 
of the nodes in s, and symbol “+’ is used to indicate the concatenation of strings. 
Moreover, we denote with buildDoc(£) the application of buildDoc to the root 
extraction event in £ . 

Definition 9 (Extracted XML document). Given a wrapper W1Z = (T>,w) 
and an XHTML document doc , an XML document xdoc is extracted from doc 
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Table 1. Elements, events and rules in a wrapper for Amazon pages 



element 


events 


rules 


store 


£1 = (0, 1, store, [24], 1) 


f store ~ (Stioc : /table — » $store, 

{$store : preceding-sibling: :*[1] //select — * $list}) 


book 


£ 2 = (1. 2, book, [30], 1) 
£10 = (1* 10, book, [68] , 2) 
£ 17 = (1, 17, book, [106], 3} 


fbook = (Sstore : /tr/table — » $book, 
{$book : preceding-sibling: :*[l]//img — ► $image}) 


title 


£3 = <2, 3, title, [32], 1) 
e ll = < 10 > H> title, [70], 1) 


ftitle = (Shook : /tr/td -» $title, 
{$title : //a — * $anchor_text} ) 


author 


£4 = (2, 4, author, [36], 2) 
e 12 = (10, 12, title, [74] , 2) 


f author ~ ($book : /tr/td — » $author, 

{ $author : . [contains (content . text () , ‘ author ’ ) 
or contains (content .text () , ‘editor’)] — ► $v } ) 
r author = Ua-uthor’ (V =). maximize) 


year 


£5 = (2, 5, year, [38], 3) 
£13 = (10, 13, year, [76], 3) 


fyear = (Shook : /tr/td -» $year, 

{$year : . [contains (content. text (), ‘year’)] — * $y}) 


oust omer jrat e 


£0 = (2,6, customer jrate , [43], 4) 
£21 = (17, 21, oust omer _rate, [120], 4) 


f crate = (Shook : /tr[3]/td — » $customer jrate) 


rate 


£7 = (6, 7, rate, [44] , 1) 
£22 = (21, 22, rate, [121], 1) 


frate = (S customer jrate : /img — » $rate) 


nojrate 


£ g = (2, 8, nojrate, [43] , 5) 
£ 15 = (10, 15, rate, [120], 5) 


f norate = (Shook : /tr[3]/td — » $nojrate) 


price 


£9 = (2, 9, price, [61], 6) 
£ 16 = (10, 16, year, [99], 6) 


f price = (Shook : /tr/td — » $price, 

{$price : . [contains (content . text () , ‘Buy new’)] — ► $p } ) 



by applying W7 Z (hereinafter referred to as V\£R{doc) xdoc) if there exists an 
extraction model £ of WlZ(doc) such that xdoc = buildDoc(£). 

Moreover, we denote with X Doc(WlZ(doc)) the set of all the XML documents 
xdoc such that WlZ(doc) xdoc. □ 

Theorem 1. Let W1Z = ( T > , w) be a wrapper and doc be an XHTML document. 
If T> is not recursive and all the extraction rules in W1Z are monotonic then: 

1. each extraction model £ of WlZ(doc) is finite, and the cardinality of £ is 
bounded by a polynomial with respect to the size of doc\ 

2. the set X Doc(WTZ(doc)) is finite. □ 

4.2 Preferred Extraction Models 

Extraction models provide us a characterization of the set of XML documents 
that encode the information extracted by a wrapper W1Z from a given XHTML 
document doc, i.e. the set XDoc(WTZ{doc)). Each document in this set represents 
a candidate result of the application of W1Z to doc. However, this should not be 
a desirable property for a wrapping framework. 

In this section we investigate the requirements to identify a unique document 
which is preferred with respect to all the candidate XML extracted documents. 
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Firstly, we introduce an order relation between sets of extraction events having 
the same parent element type. Consider two extraction models £ \ and £2 of 
WTZ(doc), and two events (ppidi,pidi,e, s,pos) € £\ and {ppid2,pid2, e, s,pos) G 
£ 2 ■ We say that £\(pid\) precedes £2(pid2) (hereinafter referred to as £\{pid\) -< 
£1 (pid2)) if the following conditions hold: 

— elnames{£i(pid \ )) precedes elnames(£2{pid2)) in the language L(P(e)), or 

— elnames(£i(pidi)) is equal to elnames(£2{pid2)), and there exists a position 
pos such that 

• for each i < pos, if (jridi,idi,ei,si,i) € £\ and (pid2,id2,e2,S2,i) G £2 
then ei = e2 and si = s%, and 

• if (pidi, idi,ei,si,pos) G £\ and {pid2, id2, €.2, S2,pos) G £2 then si pre- 
cedes S2 in w(e,ei)(s), or 

— elnames(£i(pidi)) is equal to elnames(£ 2(1^2)) and there exists a position 
pos such that 

• for each i < pos, if (pid\,id\,e\,si,i) G £\ and (pid2,id2,e2,S2,i) G £2 
then ei = e2 and si = S2, and £\{id\) -f, £2(^2) and £1(^2) 7^ £2(^1) 

• if {pid\,id\,e\, s\,pos) G £1 and (pid2,id2,e\,s\,pos) G £2 then £\(idi) 
-< £2(^2). 

The above order relation allows us to define an order relation between sets of 
extraction events, and consequently between extracted documents. Given two ex- 
traction models £\ and £2 of WlZ(doc), we have that £1 precedes £2 (£1 -< £2) if 
£i(0) A £2(0). Moreover, given two XML documents xdoci and xdoc2 generated 
from £1 and £2, respectively, we say that xdoc\ precedes xdoc2 {xdoc\ -< xdoc2) 
if, for each model £2 of xdoc2, there exists a model £1 of xdoc\ such that £1 ^ £2- 

Definition 10 (Preferred extracted document). Let V = ( El,P,e r ) be a 
DTD, W 1 Z = ( V,w ) be a wrapper, doc be an XHTML document and xdoc be 
an XML document in XDoc(WTZ{doc)). xdoc is preferred in XDoc{WR{doc)) 
if, for each document xdoc' G XDoc(V\£R{doc)), xdoc A xdoc' holds. □ 

Theorem 2. Let V = (El, P, e r ) be a DTD, W 1 Z = (V, w) be a wrapper, 
and doc be an XHTML document. There exists a unique preferred extracted 
document pxdoc in X Doc(WlZ(doc)) . □ 

5 Conclusions and Future Work 

In this work, we posed the theoretical basis for exploiting the schema of the 
information to be extracted in a wrapping process. We provided a clean declar- 
ative semantics for schema-based wrappers, through the definition of extraction 
models for source HTML documents with respect to a given wrapper. We also 
addressed the issue of wrapper evaluation, developing an algorithm which works 
in polynomial time with respect to the size of a source document; the reader is 
referred to [5] for detailed information. 

We are currently developing a system that implements the proposed wrapping 
approach. As ongoing work, we plan to introduce enhancements to extraction 
schema. In particular, we are interested in considering XSchema constraints and 
relaxing the one-unambiguous property. 
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Abstract. We address the problem of integrating objects from a source taxon- 
omy into a master taxonomy. This problem is not only currently pervasive on 
the web, but also important to the emerging semantic web. A straightforward 
approach to automating this process would be to train a classifier for each cate- 
gory in the master taxonomy, and then classify objects from the source taxon- 
omy into these categories. Our key insight is that the availability of the source 
taxonomy data could be helpful to build better classifiers in this scenario, there- 
fore it would be beneficial to do transductive learning rather than inductive 
learning, i.e., learning to optimize classification performance on a particular set 
of test examples. In this paper, we attempt to use a powerful transductive learn- 
ing algorithm. Spectral Graph Transducer (SGT), to attack this problem. Notic- 
ing that the categorizations of the master and source taxonomies often have 
some semantic overlap, we propose to further enhance SGT classifiers by in- 
corporating the affinity information present in the taxonomy data. Our experi- 
ments with real-world web data show substantial improvements in the perform- 
ance of taxonomy integration. 



1 Introduction 

A taxonomy, or directory or catalog, is a division of a set of objects (documents, im- 
ages, products, goods, services, etc.) into a set of categories. There are a tremendous 
number of taxonomies on the web, and we often need to integrate objects from a 
source taxonomy into a master taxonomy. 

This problem is currently pervasive on the web, given that many websites are ag- 
gregators of information from various other websites [1], A few examples will illus- 
trate the scenario. A web marketplace like Amazon 1 may want to combine goods 
from multiple vendors’ catalogs into its own. A web portal like NCSTRL 2 may want 
to combine documents from multiple libraries’ directories into its own. A company 
may want to merge its service taxonomy with its partners’. A researcher may want to 



1 http://www.amazon.com/ 

2 http://www.ncstrl.org/ 
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merge his/her bookmark taxonomy with his/her peers’. Singapore-MIT Alliance 3 , an 
innovative engineering education and research collaboration among MIT, NUS and 
NTU, has a need to integrate the academic resource (courses, seminars, reports, soft- 
wares, etc.) taxonomies of these three universities. 

This problem is also important to the emerging semantic web [2], where data has 
structures and ontologies describe the semantics of the data, thus better enabling com- 
puters and people to work in cooperation. On the semantic web, data often come from 
many different ontologies, and information processing across ontologies is not possi- 
ble without knowing the semantic mappings between them. Since taxonomies are 
central components of ontologies, ontology mapping necessarily involves finding the 
correspondences between two taxonomies, which is often based on integrating objects 
from one taxonomy into the other and vice versa [3, 4]. 

If all taxonomy creators and users agreed on a universal standard, taxonomy inte- 
gration would not be so difficult. But the web has evolved without central editorship. 
Hence the correspondences between two taxonomies are inevitably noisy and fuzzy. 
For illustration, consider the taxonomies of two web portals Google 4 and Yahoo 5 : 
what is “Arts/ Music/ Styles/’’ in one may be “Entertainment/ Music/ Genres/” in the 
other, category “Computers_and_Internet/ Software/ Freeware” and category “Com- 
puters/ Open_Source/ Software” have similar contents but show non-trivial differ- 
ences, and so on. It is unclear if a universal standard will appear outside specific do- 
mains, and even for those domains, there is a need to integrate objects from legacy 
taxonomy into the standard taxonomy. 

Manual taxonomy integration is tedious, error-prone, and clearly not possible at the 
web scale. A straightforward approach to automating this process would be to formu- 
late it as a classification problem which has being well-studied in machine learning 
area [5]. 

Our key insight is that the availability of the source taxonomy data could be helpful 
to build better classifiers in this scenario, therefore it would be beneficial to do trans- 
ductive learning rather than inductive learning, i.e., learning to optimize classification 
performance on a particular set of test examples. In this paper, we attempt to use a 
powerful transductive learning algorithm. Spectral Graph Transducer (SGT) [6], to 
attack this problem. Noticing that the categorizations of the master and source tax- 
onomies often have some semantic overlap, we propose to further enhance SGT clas- 
sifiers by incorporating the affinity information present in the taxonomy data. Our 
experiments with real-world web data show substantial improvements in the perform- 
ance of taxonomy integration. 

The rest of this paper is organized as follows. In §2, we review the related work. In 
§3, we give the formal problem statement. In §4, we present our approach in detail. In 
§5, we conduct experimental evaluations. In §6, we make concluding remarks. 

2 Related Work 

Most of the recent research efforts related to taxonomy integration are in the context 
of ontology mapping on semantic web. An ontology specifies a conceptualization of a 
domain in terms of concepts, attributes, and relations [7]. The concepts in an ontology 



3 http://web.mit.edu/sma/ 

4 http://www.google.com/ 

5 http://www.yahoo.com/ 
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are usually organized into a taxonomy: each concept is represented by a category and 
associated with a set of objects (called the extension of that concept). The basic goal 
of ontology mapping is to identify (typically one-to-one) semantic correspondences 
between the taxonomies of two given ontologies: for each concept (category) in one 
taxonomy, find the most similar concept (category) in the other taxonomy. Many 
works in this field use a variety of heuristics to find mappings [8-1 lj. Recently ma- 
chine learning techniques have been introduced to further automate the ontology 
mapping process [3, 4, 12-14]. Some of them derive similarities between concepts 
(categories) based on their extensions (objects) [3, 4, 12], therefore they need to first 
integrate objects from one taxonomy into the other and vice versa (i.e., taxonomy 
integration). So our work can be utilized as a basic component of an ontology map- 
ping system. 

As explained later in §3, taxonomy integration can be formulated as a classification 
problem. The Rocchio algorithm [15, 16] has been applied to this problem in [3]; and 
the Naive Bayes (NB) algorithm [5| has been applied to this problem in [4], without 
exploiting information in the source taxonomy. 

In [1], Agrawal and Srikant proposed the Enhanced Naive Bayes (ENB) approach 
to taxonomy integration, which enhances the Naive Bayes (NB) algorithm [5]. In 
[17], Zhang and Lee proposed the CS-TSVM approach to taxonomy integration, 
which enhances the Transductive Support Vector Machine (TSVM) algorithm [18] by 
the distance-based Cluster Shrinkage (CS) technique. They later proposed another 
approach in [19], CB-AB, which enhances the AdaBoost algorithm [20-22] by the 
Co-Bootstrapping (CB) technique. In [23], Sarawagi, Chakrabarti and Godboley in- 
dependently proposed the Co-Bootstrapping technique (which they named Cross- 
Training) to enhance the Support Vector Machine (SVM) [24, 25] for taxonomy inte- 
gration, as well as an Expectation Maximization (EM) based approach EM2D (2- 
Dimensional Expectation Maximization). 

This paper is actually an straightforward extension of ] 1 7 ] . Basically, the approach 
proposed in this paper is similar to ENB [1] and CS-TSVM [17], in the sense that they 
are all motivated by the same idea: to bias the learning algorithm against splitting 
source categories. In this paper, we compare these two state-of-the-art approaches 
with ours both analytically and empirically. Comparisons with other approaches are 
left for future work. 

3 Problem Statement 

Taxonomies are often organized as hierarchies. In this work, we assume for simplic- 
ity, that any objects assigned to an interior node really belong to a leaf node which is 
an offspring of that interior node. Since we now have all objects only at leaf nodes, 
we can flatten the hierarchical taxonomy to a single level and treat it as a set of cate- 
gories [1], 

Now we formally define the taxonomy integration problem that we are solving. 
Given two taxonomies: 

• a master taxonomy M. with a set of categories Cj, C 2 , . . ., C M each containing a set 
of objects, and 

• a source taxonomy A/" with a set of categories Sj, S',, S N each containing a set of 
objects, 

we need to find the category in A4 for each object in J\f. 
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To formulate taxonomy integration as a classification problem, we take C r C 2 , . .., 
C M as classes, the objects in AT as training examples, the objects in A/" as test exam- 
ples, so that taxonomy integration can be automatically accomplished by predicting 
the class of each test example. 

It is possible that an object in A/" belongs to multiple categories in AT. Besides, 
some objects in A/" may not fit well in any existing category in AT, so users may want 
to have the option to form a new category for them. It is therefore instructive to create 
an ensemble of binary (yes/no) classifiers, one for each category C in AT. When 
training the classifier for C, an object in AT is labeled as a positive example if it is 
contained by C or as a negative example otherwise. All objects in A/” are unlabeled 
and wait to be classified. This is called the “one-vs-rest” ensemble method. 



4 Our Approach 

Here we present our approach in detail. In §4.1, we review transductive learning and 
explain why it is suitable to our task. In §4. 1 , we review Spectral Graph Transducer 
(SGT). In §4.3, we propose the similarity-based Cluster Shrinkage (CS) technique to 
enhance SGT classifiers. In §4.4, we compare our approach with ENB and CS- 
TSVM. 



4.1 Transductive Learning 

Regular learning algorithms try to induce a general classifying function which has 
high accuracy on the whole distribution of examples. However, this so-called induc- 
tive learning setting is often unnecessarily complex. For the classification problem in 
taxonomy integration situations, the set of test examples to be classified are already 
known to the learning algorithm. In fact, we do not care about the general classifying 
function, but rather attempt to achieve good classification performance on that par- 
ticular set of test examples. This is exactly the goal of transductive learning [26]. 

The transductive learning task is defined on a fixed array of n examples 
X = (x,,x 2 ,...,xj . Each example has a desired classification Y = (y 1 , y 2 ,—,y„) , where 
y t e {+1,-1} for binary classification. Given the labels for a subset Y, c [1 ,.n\ of 
\Y,\ = l<n (training) examples, a transductive learning algorithm attempts to predict 

the labels of the remaining (test) examples in X as accurately as possible. 

Several transductive learning algorithms have been proposed. A famous one is 
Transductive Support Vector Machine (TSVM), which was introduced by [26] and 
later refined by [18, 27] . 

Why can transductive learning algorithms excel inductive learning algorithms? 
Transductive learning algorithms can observe the examples in the test set and poten- 
tially exploit structure in their distribution. For example, there usually exists a cluster- 
ing structure of examples: the examples in same class tend to be close to each other in 
feature space, and such kind of knowledge is helpful to learning, especially when 
there are only a small number of training examples. 
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Most machine learning algorithms assume that both the training and test examples 
come from the identical data distribution. This assumption does not necessarily hold 
in the case of taxonomy integration. Intuitively, transductive learning algorithms seem 
to be more robust than inductive learning algorithms to the violation of this assump- 
tion, since transductive learning algorithms takes the test examples into account for 
learning. This interesting issue needs to be stressed in the future. 



4.2 Spectral Graph Transducer 



Recently, Joachims introduced a new transductive learning method, Spectral Graph 
Transducer (SGT) [6], which can be seen as a transductive version of the k nearest- 
neighbor (kNN) classifier. 

SGT works in three steps. The first step is to build the k nearest-neighbor (kNN) 
graph G on the set of examples X. The kNN graph G is similarity-weighted and sym- 
metrized: its adjacency matrix is defined as A = A' + A' 7 , where 



sim(x i ,x,') 



A' =■ 



J x k e.knn(Xj ) 

0 



sim(x i ,x k ) 



if x . e hm(x t ) 



else 



The function sim(-, ■) can be any reasonable similarity measure. In the following, 
we will use a common similarity function 

«m(x ; ,x.) = cos0 = 

where 9 represents the angle between x. and x j The second step is to decompose G 
into spectrum, specifically, compute the smallest 2 to d + 1 eigenvalues and corre- 
sponding eigenvectors of G’s normalized Laplacian L = B 1 ( 6 - A ) , where B is the 
diagonal degree matrix with B u = ^ . A tJ . The third step is to classify the examples. 
Given a set of training labels Y , , SGT makes predictions by solving the following 
optimization problem which minimizes the normalized graph cut with constraints: 

cut(G + ,G~) 

mln V 17 7TT7 71 

y [{/ : y, = +1}| |{i : y t = — 1}| 

s.t. y i = +1, if i e Y l and positive 
y t = —1, if i e Y, and negative 

y = {+W}\ 




where G * and G denote the set of examples (vertices) with y ,=+ 1 and y ,=- 1 
respectively, and the cut-value cut(G + ,G ) = X,eg*X/e g -^j is the sum of the edge 

weights across the cut (bi-partitioning) defined by G + and G . Although this opti- 
mization problem is known to be NP-hard, there are highly efficient methods based 
on the spectrum of the graph that give good approximation to the global optimal solu- 
tion [6]. 
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For example, consider a classification problem with 6 examples X = (x p x 2 , x 3 , x 4 , 
x 5 , x 6 ) whose kNN graph G is shown in Figure 1 (adopted from [6]) with line thick- 
ness indicating edge weight. Given a set of training labels J(={1,6}: y 1 =+ 1 and 
y 6 =- 1, SGT predicts y 2 and y, to be positive whereas predicts y 4 and y s to 
be negative, because cutting G into G + ={x,,x,,x 3 } and G = {x 4 ,x 5 ,x 6 } gives the 
minimal normalized cut-value while keeping x, e G + and x 6 e G . 



Fig. 1. SGT does classification through minimizing the normalized graph cuts with constraints 

Unlike most other transductive learning algorithms, SGT does not need any addi- 
tional heuristics to avoid unbalanced splits [6]. Furthermore, since SGT has a mean- 
ingful relaxation that can be solved globally optimally with efficient spectral methods, 
it is more robust and promising than existing methods. 

4.3 Similarity-Based Cluster Shrinkage 

Applying SGT to taxonomy integration, we can effectively use the objects in A/" (test 
examples) to boost classification performance. However, thus far we have completely 
ignored the categorization of A f 

Although A4 and A/” are usually not identical, their categorizations often have 
some semantic overlap. Therefore the categorization of A/" contains valuable implicit 
knowledge about the categorization of A4. For example, if two objects belong to the 
same category S in A f, they are more likely to belong to the same category C in 
A4 rather than to be assigned into different categories. We hereby propose the simi- 
larity-based Cluster Shrinkage (CS) technique to further enhance SGT classifiers by 
incorporating the affinity information present in the taxonomy data. 

4.3.1 Algorithm 

Since SGT models the learning problem as a similarity-weighted kNN-graph, it offers 
a large degree of flexibility for encoding prior knowledge about the relationship be- 
tween individual examples in the similarity function. Our proposed similarity-based 
CS technique takes all categories as clusters and shrinks them by substituting the 
regular similarity function sim{-, ) with the CS similarity function cs-sim(-,-) . 

Definition 1. The center of a category S is c = — Vx. 
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Definition 2. The CS similarity function cs-sim(-,-) for two examples x, e .S', and 
Xj g S f is defined as cs-sim(x i ,x ] )=y- sim (c ( , c ; ) + (1 - y) ■ sim (x ( ,x y ) , 0 < / < 1 , where 
c, and c, are the centers of S, and 5, respectively,. 

When an example x belongs to multiple categories S 0) ,S m whose cen- 
ters are c"’,c ,2) ,...,c <s) respectively, its corresponding category center in the above 



We name our approach that uses SGT classifiers enhanced by the similarity-based 
CS technique as CS-SGT. 

4.3.2 Analysis 

Theorem 1. For any pair of examples x. and x in the same category S , 
cs-sim(x l , x . ) > sim(x i , x ) . 

Proof: Suppose the center of S is c , we get 
cs-sim(x . , x . ) = y- sim( C, C) + (1 — y) ■ sim(x . , x ) 

Since sim( c, c) > sim(x . , x ) and y > 0 , we get 
y- sim( c,c) > ysim(x.,x .) , therefore 

y- sim( C,C) + (1 — y) ■ sim(x.,x. ) > y- sim(x.,x . ) + (1— y) ■ sim(x.,x . ) , i.e. 
cs-sim(x . , x . ) > sim(x . , x ) . 

From the above theorem, we see that CS-SGT increases the similarity between ex- 
amples that are known in the same category, consequently puts more weight to the 
edge between them in the kNN graph. Since SGT seeks the minimum normalized 
graph cut, stronger connection among examples in the same category directs SGT to 
avoid splitting that category, in other words, to reserve the original categorization of 
the taxonomies to some degree while doing classification. Through substituting the 
regular similarity function with the CS similarity function, the CS-SGT approach can 
not only make effective use of the objects in J\f like SGT, but also make effective use 

of the categorization of Af. 

The CS similarity function cs-sim(x.,x .) is actually a linear interpolation of 
sim(x.,Xj) and sim( c,c) . The linear interpolation parameter 0 < y < I controls the 
influence of the original categorization on the classification. When y = 1 , CS-SGT 
classifies all objects belonging to one category in Af as a whole into a specific cate- 
gory in AA.. When /= 0 , CS-SGT is just the same as SGT. As long as the value of 
y is set appropriately, CS-SGT should never be worse than SGT because it includes 
SGT as a special case. The optimal value of y can be found using a tune set (a set 
of objects whose categories in both taxonomies are known). The tune set can be made 
available via random sampling or active learning, as described in [1], 
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4.4 Comparison with ENB and CS-TSVM 

Both ENB and CS-TSVM outperform conventional machine learning methods in 
taxonomy integration, because they are able to leverage the source taxonomy data to 
improve classification. CS-SGT also follows this idea to enhance SGT for taxonomy 
integration. 

ENB [1] is based on NB [5] which is an inductive learning algorithm. In contrast, 
CS-TSVM is based on TSVM [18] which is a transductive learning algorithm. It has 
been shown that CS-TSVM is more effective than ENB [17] in taxonomy integration. 
However, CS-TSVM is not as efficient as ENB because TSVM runs much slower 
than NB. 

CS-SGT is based on the recently proposed transductive learning algorithm SGT 
[6]. We think CS-SGT should achieve similar performance as CS-TSVM, because in 
theory SGT connects to a simplified version of TSVM, and both of them attempt to 
incorporate the affinity information present in the taxonomy data into learning. This 
has been confirmed by our experiments. On the other hand, CS-SGT is much more 
efficient than CS-TSVM because of the following three reasons. 

(1) CS-TSVM is based on TSVM that uses computational-expensive greedy search to 
get a local optimal solution. In contrast, CS-SGT is based on SGT that uses effi- 
cient spectral methods to get the global optimal solution. 

(2) CS-TSVM must run SVM first to get a good estimation of the fraction of the 
positive examples in the test set [17] because TSVM requires that fraction to be 
fixed a priori [18]. In contrast, CS-SGT does not need this kind of extra- 
computation due to the merit of SGT in automatically avoiding unbalanced splits 
[6]. 

(3) CS-TSVM requires training a TSVM classifier from scratch for each master cate- 
gory, using the “one-vs-rest” ensemble method for multi-class multi-label classi- 
fication (as stated in §2). In contrast, CS-SGT (or SGT) needs to build and de- 
compose the kNN graph only once for a specific set of examples (dataset), hence 
saves a lot of time. It has been observed that construction of the kNN graph is the 
most time-consuming step of SGT, but it can be sped up using appropriate data 
structures like inverted indices or kd-trees [6]. 

The CS-SGT approach’s prominent advantage in efficiency has also been con- 
firmed by our experiments. 

In summary, the CS-SGT approach is able to achieve similar performance as CS- 
TSVM in taxonomy integration while holding high efficiency as ENB. 

5 Experiments 

We conduct experiments with real-world web data, to demonstrate the advantage of 
our proposed CS-SGT approach to taxonomy integration. To facilitate comparison, 
we use exactly the same datasets and experimental setup as [17]. 

5.1 Datasets 

We have collected 5 datasets from Google and Yahoo: Book, Disease, Movie, Music 
and News. One dataset includes the slice of Google’s taxonomy and the slice of Ya- 
hoo’s taxonomy about websites on one specific topic. 
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In each slice of taxonomy, we take only the top level directories as categories, e.g., 
the “Movie” slice of Google’s taxonomy has categories like “Action”, “Comedy”, 
“Horror”, etc. 

In each category, we take all items listed on the corresponding directory page and 
its sub-directory pages as its objects. An object (listed item) corresponds to a website 
on the world wide web, which is usually described by its URL, its title, and optionally 
a short annotation about its content. 

The set of objects occurred in both Google and Yahoo covers only a small portion 
(usually less than 10%) of the set of objects occurred in Google or Yahoo alone, 
which suggests the great benefit of automatically integrating them. This observation is 
consistent with [1], 

The number of categories per object in these datasets is 1.54 on average. This ob- 
servation confirms our previous statement in §3 that an object may belong to multiple 
categories, and justifies our strategy to build a binary classifier for each category in 
the master taxonomy. 

The category distributions in all theses datasets are highly skewed. For example, in 
Google’s Book taxonomy, the most common category contains 21% objects, but 88% 
categories contain less than 3% objects and 49% categories contain less than 1% ob- 
jects. In fact, skewed category distributions have been commonly observed in real- 
world applications [28]. 

5.2 Tasks 

For each dataset, we pose 2 symmetric taxonomy integration tasks: G<— Y (integrating 
objects from Yahoo into Google) and Y<— G (integrating objects from Google into 
Yahoo). 

As described in §3, we formulate each task as a classification problem. The objects 
in G H Y can be used as test examples, because their categories in both taxonomies are 
known to us [1]. We hide the test examples’ master categories but expose their source 
categories to the learning algorithm in training phase, and then compare their hidden 
master categories with the predictions of the learning algorithm in test phase. Suppose 
the number of the test examples is n . For G<— Y tasks, we randomly sample n 
objects from the set G-Y as training examples. For Y<— G tasks, we randomly sample 
n objects from the set Y-G as training examples. This is to simulate the common 
situation that the sizes of At and A/" are roughly in same magnitude. For each task, we 
do such random sampling 5 times, and report the classification performance averaged 
over these 5 random samplings. 

5.3 Features 

For each object, we assume that the title and annotation of its corresponding website 
summarizes its content. So each object can be considered as a text document com- 
posed of its title and annotation 6 . 

The most commonly used feature extraction technique for text data is to treat a 
document as a bag-of-words [18, 25]. For each document d in a collection of docu- 
ments D , its bag-of-words is first pre-processed by removal of stop-words and 



6 Note that this is different with [1. 23] which take actual Web pages as objects. 
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stemming. Then it is represented as a feature vector x = (x 1 ,x 2 ,...,x m ) , where x 
indicates the importance weight of term w. (the z'-th distinct word occurred in D ). 
Following the TFxIDF weighting scheme, we set the value of x. to the product of 
the term frequency TF(w j ,d) and the inverse document frequency IDF(w .), i.e., 
TF(w j ,d)xIDF(w i ) . The term frequency TF{w iy d) means the number of occur- 
rences of w. in d . The inverse document frequency is defined as 



OF ( w ) is the number of documents in which w. occur. Finally all feature vectors 
are normalized to have unit length. 

5.4 Measures 

As stated in §3, it is natural to accomplish a taxonomy integration task via an ensem- 
ble of binary classifiers, each for one category in AT. To measure classification per- 
formance, we use the standard /-’-score (F 1 measure) [15]. The /-’-score is defined as 
the harmonic average of precision (p) and recall (r), F = 2 /«-/( p + r) , where precision 
is the proportion of correctly predicted positive examples among all predicted positive 
examples, and recall is the proportion of correctly predicted positive examples among 
all true positive examples. The /-’-scores can be computed for the binary decisions on 
each individual category first and then be averaged over categories. Or they can be 
computed globally over all the Mxn binary decisions where M is the number of 
categories in consideration (the number of categories in AT) and n is the number of 
total test examples (the number of objects in A/). The former way is called macro- 
averaging and the latter way is called micro-averaging [28]. It is understood that the 
micro-averaged /-’-score (miF) tends to be dominated by the classification perform- 
ance on common categories, and that the macro-averaged /-’-score ( maF) is more 
influenced by the classification performance on rare categories [28]. Since the cate- 
gory distributions are highly skewed (see §5.1), providing both kinds of scores is 
more informative than providing either alone. 

5.5 Settings 

We use the SGT software implemented by Joachims 7 with the following parameters : 
“-k 10”, “-d 100”, “-c 1000 -t f -p s”. We set the parameter y for CS similarity 
function to 0.2. Fine-tuning y using tune sets would decisively generate better re- 
sults than sticking with a pre-fixed value. In other words, the performance superiority 
of CS-SGT is under-estimated in our experiments. 

5.6 Results 

The experimental results of SGT and CS-SGT are shown in Table 1. We see that CS- 
SGT really can achieve much better performance than SGT for taxonomy integration. 



/ 




IDF(w i ) = log 



where |/)| is the total number of documents in D , and 



V 



7 http://sgt.joachims.org/ 
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We think this is because CS-SGT makes effective use of the affinity information 
present in the taxonomy data. 



Table 1 . Experimental Results of SGT and CS-SGT 







SGT 


CS-SGT 


maF 


miF 


maF 


miF 


G<— Y 


Book 


0.2191 


0.5161 


0.3167 


0.6502 


Disease 


0.4429 


0.4639 


0.6602 


0.7269 


Movie 


0.1388 


0.3175 


0.2976 


0.6373 


Music 


0.2371 


0.3461 


0.4148 


0.5766 


News 


0.2992 


0.4916 


0.4499 


0.6955 


Y<— G 


Book 


0.3608 


0.4132 


0.4894 


0.5844 


Disease 


0.4162 


0.4222 


0.5778 


0.7431 


Movie 


0.2516 


0.3934 


0.4162 


0.6071 


Music 


0.2655 


0.2901 


0.5464 


0.7479 


News 


0.3612 


0.4698 


0.5113 


0.6521 



In Figure 2 and 3, we compare the experimental results of CS-SGT and those of 
ENB and CS-TSVM which come from [17]. We see that CS-SGT outperforms ENB 
consistently and significantly. We also find that CS-SGT’ s macro-averaged E-scores 
are slightly lower than those of CS-TSVM, and its micro-averaged E-scores are com- 
parable to those of CS-TSVM. On the other hand, our experiments demonstrated that 
CS-SGT was much faster than CS-TSVM: CS-TSVM took about one or two days to 
run all the experiments while CS-SGT finished in several hours. 




Fig. 2. Comparing the macro-averaged F-scores of ENB, CS-TSVM and CS-SGT 



6 Conclusion 

Our main contribution is to show how Spectral Graph Transducer (SGT) can be en- 
hanced for taxonomy integration tasks. We have compared the proposed CS-SGT 
approach to taxonomy integration with two existing state-of-the-art approaches, and 
demonstrated that CS-SGT is both effective and efficient. 
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Fig. 3. Comparing the micro-averaged F-scores of ENB, CS-TSVM and CS-SGT 



The future work may include: comparing with the approaches in [ 19, 23], incorpo- 
rating commonsense knowledge and domain constraints into the taxonomy integration 
process, extending to full-functional ontology mapping systems, and so forth. 
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Abstract. The fc-Nearest-Neighbors (&NN) method for classification is 
simple but effective in many cases. The success of kNN in classifica- 
tion depends on the selection of a “good value” for k. In this paper, we 
proposed a contextual probability-based classification algorithm (CPC) 
which looks at multiple sets of nearest neighbors rather than just one 
set of k nearest neighbors for classification to reduce the bias of k. The 
proposed formalism is based on probability, and the idea is to aggre- 
gate the support of multiple neighborhoods for various classes to better 
reveal the true class of each new instance. To choose a series of more rele- 
vant neighborhoods for aggregation, three neighborhood selection meth- 
ods: distance-based, symmetric-based, and entropy-based neighborhood 
selection methods are proposed and evaluated respectively. The experi- 
mental results show that CPC obtains better classification accuracy than 
fcNN and is indeed less biased by k after saturation is reached. Moreover, 
the entropy-based CPC obtains the best performance among the three 
proposed neighborhood selection methods. 



1 Introduction 

fcNN is a simple but effective method for classification [1], For an instance to be 
classified, its k nearest neighbors are retrieved, and this forms a neighborhood of 
t. Majority voting among the instances in the neighborhood is commonly used 
to decide the classification for t, with or without consideration of the distance- 
based weighting. Despite its conceptual simplicity, fcNN performs as well as any 
other possible classifier when applied to non-trivial problems. Over the last 50 
years, this simple classification method has been extensively used in a broad 
range of applications such as medical diagnosis, text categorization [2], pattern 
recognition [3], data mining [4], and e-commerce. However, to apply fcNN we 
need to choose an appropriate value for k, and the success of classification is 
very much dependent on this value. In a sense, fcNN is biased by k. There are 
many ways of choosing the k value, and a simple one is to run the algorithm many 
times with different k values and choose the one with the best performance. But 
this is not a pragmatic method in real applications. 
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In order for fcNN to be less dependent on the choice of k, we propose to look at 
multiple sets of nearest neighbors rather than just one set of k nearest neighbors. 
As we know that for an instance t each neighborhood bears support for different 
possible classes. The proposed formalism is based on contextual probability [5], 
and the idea is to aggregate the support of multiple sets of nearest neighbors 
for various classes to give a more reliable support value, which better reveals 
the true class of t. However, in practice the given data set is usually a sample 
of the underlying data space, it is impossible to gather all the neighborhoods to 
aggregate the support for classifying a new instance. On the other hand, even if 
it is possible to gather all the neighborhoods of a given new instance for classifi- 
cation, the computational cost could be unbearable. In a sense, the classification 
accuracy of CPC depends on a given number of chosen neighborhoods. So meth- 
ods used to select more relevant neighborhoods for aggregation in the process of 
picking up neighborhoods are important. Having identified the existing problems 
of CPC, we propose three neighborhood selection methods in this paper, aimed 
at choosing a set of neighborhoods as informative as possible for classification 
to further improve the classification accuracy of CPC. 

The rest of the paper is organized as follows: Section 2 describes the con- 
textual probability-based classification method. Section 3 introduces the three 
neighborhood selection methods: distance-based, symmetric-based, and entropy- 
based neighborhood selection methods. The experimental results are described 
and discussed in Section 4. Section 5 ends the paper with a summary, linking on 
existing problems and further research directions. 



2 Contextual Probability-Based Classification 

Let 17 be a finite set called a frame of discernment. A mass function is m : 2 n 
[0, 1] such that 



E = 1 (!) 

xcn 

The mass function is interpreted as a representation (or measure) of knowledge 
or belief about 17, and m(A) is interpreted as a degree of support for A C 17 [6, 
7]. To extend our knowledge to an event, A, that we cannot evaluate explicitly 
for m, we define a new function G : 2 n — > [0,1] such that for any AC 17 

G{A) = E (2) 

xcn ' ' 

This means that the knowledge of event A may not be known explicitly in the 
representation of our knowledge, but we know explicitly some events X that are 
related to it (i.e. A overlaps with A or A flA ^ <f>). Part of the knowledge about 
X, m(X), should then be shared by A, and a measure of this part is |Afl A|/|A|. 
The mass function can be interpreted in different ways. In order to solve the 
aggregation problem, one interpretation is made as follows. 
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Let S be a finite set of class labels, and 1? be a finite data set each element 
of which has a class label in S. The labelling is denoted by a function 
so that for xG 17, f(x) is the class label of x. 

Consider a class c € S. Let N = \ f2\, N c = \x G f2\ : f(x) = c|, and M c = 
Ygl7 P(c\X) . The mass function for c is defined as m c : 2 ° — > [0, 1] such that, 
for AC H, 



m c (A ) 



P(c\A) 

XxcnP{c\X) 



P(c\A) 

M c 



( 3 ) 



clearly m c(X) = 1 , and if the distribution over 17 is uniform, then 

M c = jf(2 N — 1). Based on the mass function, the aggregation function for c is 
defined as G c : 2 a — > [0, 1] such that, for 



G C (A) = m d x ) 

XCQ 



|AnA| 



( 4 ) 



When A is singleton, denoted as a, equation (4) can be changed to equation (5). 

G c (a) = Y mc(X) (5) 

x<zn ' ' 



If the distribution over 17 is uniform then, for a € 17 and c G S, G c (a ) can be 
represented as equation (6). 



G c (a) = P(c\a)a c + (3 



( 6 ) 



Let Cpf represent the number of ways of picking n unordered outcomes from 
N possibilities, then, 



= w E £«*i. - & »d 



N 



N c A 



Let t be an instance to be classified. If we know P(c\t) for all c G S then 
we can assign t to the class c that has the largest P(c\t). Since the given data 
set is usually a sample of the underlying data space we may never know the 
true P(c\t). All we can do is to approximate P(c|t). Equation (6) shows the re- 
lationship between P(c|f) and G c (t), and the latter can be calculated from some 
given events. If the set of events is complete, i.e. 2 n , we can accurately calculate 
G c (t) and hence P(c|i); otherwise if it is partial, i.e. a subset of 2°, G c (t ) is 
a approximate and so is P(c|f). From equation (5) we know that the more we 
know about a the more accurate G c (a) (and hence P(c|a)) will be. As a result, 
we can try to gather as many relevant events about a as possible. In the spirit 
of fcNN we can deem the neighborhood of a as relevant. Therefore we can take 
neighborhoods of t as events. But in practice, the more neighborhoods chosen 
for classification, the more computational cost it takes. With limited comput- 
ing time, the choice of the more relevant neighborhoods is non-trivial. This is 
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one reason that motivated us to seek a series of more relevant neighborhoods 
to aggregate the support for classification. Also in the spirit of fcNN, for an in- 
stance t to be classified, the closer an instance is to t, the more contribution 
the instance donates for classifying t. Based on this understanding, for a given 
number of neighborhoods (for example, k) chosen for aggregation, we choose a 
series of specific neighborhoods, which we think are relevant to an instance to 
be classified, for classification. 

Summarizing the above discussion we propose the following procedure for 
CPC. 

1. Determine N and N c for every class c £ S, and then calculate j3 and M c . 
These numbers are valid for any t 6 Q. 

2. Select a number of neighborhoods Ai,A 2 , ■ ■ ■ , A/.. 

3. Calculate m c (A,) = |A-j/(|Aj| x M c ) for all c £ S and * = 1, 2, • • • , k. 

4. Calculate G c (t) = m c(A.i)/\Ai\ for every c £ S. 

5. Calculate P(c|f) for every c € S. 

6. Classify t for c that has the largest P(c\t). 

In its simplest form fcNN is majority voting among the k nearest neighbors 
of t € 17. In our terminology fcNN can be described as follows: 

Select one neighborhood A of t , calculate m c (A) = |{rc € A\ : f(x) = 
c}|/|A| = |A C |/|A|, then calculate G c = m c {A)/\A\ = |A C |/|A| 2 , and then finally 
classify t by largest G c (t). We can see that fcNN considers only one neighbor- 
hood, and it does not take into account the proportion of instances in a class. 
In this sense, therefore, fcNN is a special case of our classification procedure. 

3 Neighborhood Selection 

In practice, a given data set is usually a sample of the underlying data space. 
It is impossible to gather all the neighborhoods to aggregate the support for 
classifying a new instance. On the other hand, even if it is possible to gather 
all neighborhoods for classification, the computational cost could be unbearable. 
So methods used to select more relevant neighborhoods for aggregation in the 
process of picking up neighborhoods are quite important. In this section, we 
describe the three proposed neighborhood selection methods: distance-based, 
symmetric, and entropy-based neighborhood selection methods which have been 
implemented in our prototype. 

3.1 Distance-Based Neighborhood Selection 

For a new instance t to be classified, distance-based neighborhood selection 
proceeds by choosing k nearest neighbors with different k as neighborhoods. 
One simple way, for example, is to ensure that for each i (i = 0, 1, • • • , k— 1) its i 
nearest neighbors make up of a neighborhood called Ai. With this convention, we 
have Ai C A i+1 and \Ai\ + 1 = |A; + i| , where i = 0, 1, • • • , k — 2. |A;| represents 
the number of neighbors within Ai . This is the simplest neighborhood selection 
method. 

Figure 1 demonstrates the first four neighborhoods using the distance-based 
neighborhood selection method. 
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Fig. 1. The first four distance-based neighborhoods around £ 



3.2 Symmetric-Based Neighborhood Selection 

Let S be a finite set of class labels denoted as S = {ci, C 2 , • • ■ , c m } and 12 be a 
finite data set denoted as 17 = {di, g^, • • • , d/v}. Each instance di in 17 denoted 
as di = (dn,di 2 , , di n ) has a class label in S. The labelling is denoted by a 
function /: 17 — > S so that for di G 17, f(di) is the class label of di. 

Firstly, we project data set 17 into n-dimensional space. Each instance is 
represented as a point in the n-dimensional space. Then we partition the n- 
dimensional space into grids. The partitioning process proceeds as follows: 

For each dimension of n-dimensional space, if feature a* is ordinal, we parti- 
tion 4c j into p equal intervals, where eq is the standard deviation of the values 
occurring for the feature a*. p is a parameter whose value is application depen- 
dent. We use symbol A t to represent the length of each cell of feature a*, i.e. 
Ai = 4crj/p. If feature a, is nominal, its discrete values provide a natural parti- 
tion. At the end of the partitioning process all the instances in data set 17 are 
distributed into the grids. 

Assume t is an instance to be classified denoted as t = (ti, £ 2 , • ■ • , t n ), the 
initial cell location of t denoted by G°, can be calculated as follows: 

— For ordinal feature ctj in cell G°, it is represented as an interval [tj — Aj /2,tj+ 

^/ 2 ]; 

— For nominal feature a,j in cell G°, it is represented as a set {tj}. 

All the instances covered by cell G° make up of the first neighborhood Aq. 
Strictly speaking, each cell in grids, e.g. G° is a hypertuple. A hypertuple 
is a tuple where entries are sets for nominal features, and intervals for ordinal 
features instead of single values [10] . 

Assume Ai is the i th neighborhood and G l = {g \ , g \ , • • • , g^) is the corre- 
sponding hypertuple, to generate the next neighborhood A i+1 the hypertuple 
G l is expanded in the following way: 

— An ordinal feature dj in G l which is represented as an interval g) = [g 1 ^ , g) 2 \, 
is expanded to [g) x - Aj,gj 2 + Aj)\ 

— An nominal feature <ij in G 1 which is represented as a set g) = {g':^ . g) 2 - • • ■ , 
g l j q }i is expanded to g* U {x}, where x G dom(aj) and x ^ g). 

where dom(ai) is a set which represents all the values of feature dj that occur in 
the training instances. All the instances covered by the newly generated hypertu- 
ple G l+1 make up of A i+X . Figure 2 is an example of three symmetric hypertuples 
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around t which are denoted as G° {blank), G 1 (Striped), and G 2 (wavy) respec- 
tively. In Figure 2, G° is covered by G 1 , and both G°, G 1 are covered by G 2 . All 
the instances covered by hypertuples G°(G 1 ,G 2 ) make up of the neighborhood 
Aq(Ai , A 2 respectively, where A 0 C A\ C A 2 . 




Fig. 2. Three symmetric neighborhoods 



3.3 Entropy-Based Neighborhood Selection 

We proposed an entropy-based neighborhood selection method by selecting a 
given number of neighborhoods with as much information for classification as 
possible. Our goal is to improve the classification accuracy of CPC. This is a 
neighborhood-expansion method by which the next neighborhood is generated 
by expanding the previous one. Obviously, the earlier one is covered by the later 
one. In each neighborhood expansion process, we calculate the entropy of each 
possible expansion (candidate) and select the one with minimal entropy as our 
next neighborhood. The smaller the entropy of a neighborhood, the more im- 
balance there is in the class distribution of the neighbors, and the more relevant 
the neighbors are to the instance to be classified. 

Assume A; is the i th neighborhood and G l = (g\, g l 2 , ■ ■ ■ , g l n ) is the corre- 
sponding hypertuple in n-dimensional space. Consider feature a,j. If it is ordinal, 
then g]j is an interval denoted as = [g 1 ^ , g^ 2 ] . The set of all the instances cov- 
ered by hypertuple (g \ , • • • , [g':^ — Aj , ,■■■ ,dh) and the set of all the instances 

covered by hypertuple (g\, ■ ■ ■ , \g)i,g) 2 + Aj], • • • , g l n ) will be two candidates for 
the next neighborhood selection. If feature cij is nominal, g* is a set denoted 
as < 7 * = {gji, <?j 2 , • • • , g) q }- For every instance x £ dom(aj ), where x ^ <?*•, the 
set of all the instances covered by hypertuple (g\, ■ ■ ■ , g* U {x}, • • • , g l n ) will be 
a candidate for the next neighborhood selection. We then calculate each candi- 
date’s entropy according to equation (7), and choose the candidate with minimal 
entropy as G I+1 . 

The entropy E 1 is defined as follows: 



E l = / J 4 1 (ci,C2,'",C m ) 

m 

lA i {ci,c 2 , = -y jPj log 2 (pj), where pj 

j = i 



( 7 ) 

|{d fc |Vd fc £ Ai,f(dk ) = Cj}\ 

\M 



(8) 



In equation (7), m is the number of classes; ci,C 2 ,---,c m are class labels. 
Pj in equation (8) is determined by counting the number of instances in the 
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candidate A; that belongs to class Cj, and presented this as a percentage of the 
total number of instances in this candidate. All the instances covered by G l+1 
make up of Ai + 1 . 

Suppose that a new instance t = (ti, t 2 > ■ ■ ■ , t n ) to be classified initially falls 
into a cell in grids represented as a hypertuple G° = (< 7 ?, • ■ ■ , < 7 °), i.e. t is 

covered by G°. For hypertuple G ° , if tj is ordinal, g® represents an interval, 
denoted by g° = where = tj - Aj/2,g° 2 = tj + Aj/2. Obviously, 

tj satisfies g 3l < tj < g ° 2 if tj is nominal, g ° is a set, denoted by < 7 ° = {t° g }, 
where tj = g® q . All the instances covered by hypertuple G° make up of a set 
denoted by Aq, which is the first neighborhood of our algorithm. The detailed 
entropy-based neighborhood selection algorithm is described as follows: 

1. Set Ao = {di\di is covered by G 0 } 

2. For i= 1 to k - 1 

{Find the i th neighborhood with minimal entropy E 1 among all the candi- 
dates expanding from Ai- 1 } 

Suppose that A i and A” are two neighborhoods of t having the same amount 
of entropy, i.e. 

I A'. (^" 1 5 ^2 7 ' ' * 5 *-m) — (Tl 7 ^-2 ; * * * 7 ^m) • 

If |AJ < | A” |, where |A,J represents the cardinality of A i , we believe that A, 
is more relevant to t than A” , so in this case, we prefer to choose A i as the next 
neighborhood. Otherwise, we prefer to choose the one with minimal E l as the 
next neighborhood. 

According to equation (7), the smaller a neighborhood’s entropy is, the more 
imbalance its class distribution is, and consequently the more information it has 
for classification. So, in our algorithm, we adopt equation (7) to be the criteria 
for neighborhoods selection. In each neighborhood expanding process, we select 
a candidate with minimal E l as the next neighborhood. 

To illustrate the method we consider an example here. For simplicity, we de- 
scribe our entropy-based neighborhood selection method in 2 -dimensional space. 

Suppose that an instance x to be classified locates at cell [3, 3] in the leftmost 
graph of Figure 3. We collect all the instances, which are covered by cell [3, 3] 
(G°), into a set called Ao as the first neighborhood. Then we try to expand 
our cell G° one step in each of 4 different directions (up, down, left, and right) 
respectively and choose a candidate with minimal E l as a new expanded area, 




Fig. 3. Neighborhood expansion process (1) 
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e.g. G l . Then we look up, down, left, right again and select a new area (e.g. G 2 
in the rightmost graph of Figure 3). All the instances covered by the expanded 
area G 1 make up of the next neighborhood called A\ and so on. At the end of 
the procedure, we obtain a series of neighborhoods e.g. A 2 , A 3 , • • • , as shown in 
Figure 3 from left to right. 

If an instance y to be classified locates at cell [2, 3] in the leftmost graph of 
Figure 4, the selection process of three neighborhoods is demonstrated by Figure 
4 from left to right. 




Fig. 4. Neighborhood expansion process (2) 



4 Experiment and Evaluation 

One motivation of this work is the fact that fcNN for classification is heavily de- 
pendent on the choice of a ‘good’ value for k. The objective of this paper is there- 
fore to come up with a method in which this dependence is reduced. A contextual 
probability-based classification method is proposed to solve this problem, which 
works in the same spirit as fcNN but needs more neighborhoods. For simplicity 
we refer to our classification procedure presented in the section 2 as nofcNN. 
To distinguish between three different neighborhood selection methods, we re- 
fer to distance-based neighborhood selection method as nofcNN(d), symmetric 
neighborhood selection method as nofcNN(s), and entropy-based neighborhood 
selection method as nofcNN(e). 

Here we experimentally evaluate the classification procedures of nofcNN(d), 
nofcNN(s), and nofcNN(e) with real world data sets in order to verify our expec- 
tations and to see if and how aggregating different neighborhoods improves the 
classification accuracy of fcNN. 

4.1 Data Sets 

In experiment, we used fifteen public data sets available from the UC Irvine Ma- 
chine Learning Repository. General information about these data sets is shown 
in Table 1. The data sets are relatively small but scalability is not an issue when 
data sets are indexed. 

In Table 1, the meaning of the column headings is follows, NF-Number of 
Features, NN-Number of Nominal features, NO-Number of Ordinal features, NB- 
Number of Binary features, NI-Number of Instances, and CD-Class Distribution. 
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Table 1. Some information about the data sets 



Data set 


NF 


NN 


NO 


NB 


NI 


CD 


Australian 


14 


4 


6 


4 


690 


383:307 


Colic 


23 


16 


7 


0 


368 


232:136 


Diabetes 


8 


0 


8 


0 


768 


268:500 


Glass 


9 


0 


9 


0 


214 


70:17:76:0:13:9:29 


HCleveland 


13 


3 


7 


3 


303 


164:139 


Heart 


13 


3 


7 


3 


270 


120:150 


Hepatitis 


19 


6 


1 


12 


155 


32:123 


Ionosphere 


34 


0 


34 


0 


351 


126:225 


Iris 


4 


0 


4 


0 


150 


50:50:50 


LiverBupa 


6 


0 


6 


0 


345 


145:200 


Sonar 


60 


0 


60 


0 


208 


97:111 


Vehicle 


18 


0 


18 


0 


846 


212:217:218:199 


Vote 


16 


0 


0 


16 


435 


267:168 


Wine 


13 


0 


13 


0 


178 


59:71:48 


Zoo 


16 


16 


0 


0 


90 


37:18:3:12:4:7:9 



4.2 Experiments 

Experiment 1. fcNN and nofcNN(d) were implemented in our prototype. In 
the experiment, 30 neighborhoods were used and for every data set. fcNN was 
run with varying number of neighbors ranging from 1 to 88 with step 3 for k , 
and nofcNN(d) was run with varying number of neighborhoods ranging from 1 
to 30 with step 1 for N. Each set of k nearest neighbors (fc=l,4,- • -,88 for fcNN) 
makes up a neighborhood. There are totally 30 neighborhoods corresponding to 
different k ranging from 1 to 88 with step 3. 

The comparison of fcNN and nofcNN(d) in classification accuracy is shown 
in Figure 5. Each value in horizontal axis, e.g. N=i , represents the number of 
neighborhoods used for aggregation for nofcNN(d) and the i th neighborhood used 
for fcNN. The k value for fcNN with respect to the i th neighborhood is 3 x * — 2. 

The detailed experimental results for fcNN and nofcNN(d) are presented in 
two separate tables: Table 2 for nofcNN(d) and Table 3 for fcNN, where N is 
varied from 1 to 10 for both fcNN and nofcNN(d). 



A Comparison of no^NN(d) and &NN in Classification Accuracy 



88 
86 
84 
82 
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1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 





Fig. 5. A comparison of noA;NN(d) and &NN in average classification accuracy 
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Table 2. Classification accuracy of nofcNN(d) in 10-fold cross validation 



Data set 


N=1 


N=2 


N=3 


N=4 


N=5 


N=6 


N=7 


N=8 


N=9 N=10 


Australian 


79.42 


80.22 


83.04 


84.06 


84.64 


85.07 


85.36 


85.22 


85.36 


85.51 


Colic 


78.89 


78.94 


81.94 


83.33 


83.61 


83.39 


83.89 


83.33 


83.05 


83.33 


Diabetes 


70.92 


71.42 


71.97 


72.24 


72.63 


73.29 


73.55 


73.68 


73.82 


74.47 


Glass 


68.10 


68.30 


67.62 


66.67 


68.57 


68.57 


69.05 


69.52 


70.00 


70.00 


HCleveland 


78.33 


78.45 


80.00 


81.67 


82.67 


82.67 


82.67 


83.00 


83.33 


83.67 


Heart 


76.30 


77.80 


78.15 


79.63 


81.48 


81.85 


81.85 


82.22 


81.48 


81.48 


Hepatitis 


80.67 


81.25 


83.33 


83.33 


82.67 


82.67 


82.00 


82.67 


81.33 


82.00 


Ionosphere 


87.14 


87.14 


86.29 


85.14 


84.86 


84.57 


84.57 


84.57 


84.29 


84.00 


Iris 


95.33 


95.55 


96.00 


96.00 


96.00 


96.00 


96.00 


96.00 


96.00 


96.00 


LiverBupa 


60.00 


60.21 


63.24 


65.29 


65.88 


66.47 


66.76 


66.18 


65.59 


65.59 


Sonar 


88.00 


88.06 


87.50 


87.50 


87.50 


88.00 


88.00 


88.00 


88.00 


87.00 


Vehicle 


68.57 


68.60 


68.57 


70.48 


70.12 


71.07 


71.43 


71.43 


71.31 


71.31 


Vote 


91.30 


91.39 


91.74 


91.74 


91.74 


91.74 


91.74 


91.74 


91.74 


92.17 


Wine 


95.88 


95.88 


95.88 


95.88 


95.88 


95.29 


94.71 


94.71 


94.71 


94.71 


Zoo 


96.67 


96.67 


96.67 


96.67 


96.67 


96.67 


96.67 


96.67 


96.67 


96.67 


Average 


81.03 81.33 82.13 82.64 82.99 83.19 83.22 83.26 83.11 


83.19 



Table 3. 


Classification accuracy of fcNN in 10-fold 


cross 


validation 


Data set 


N=1 


N=2 


N=3 


N=4 


N=5 


N=6 


N=7 


N=8 


N=9 N=10 


Australian 


79.42 


84.49 


85.22 


85.80 


85.51 


85.80 


86.23 


86.09 


85.51 


85.51 


Colic 


78.89 


83.33 


82.50 


84.44 


84.17 


85.00 


84.44 


83.89 


84.72 


85.00 


Diabetes 


70.92 


73.42 


74.34 


73.68 


74.34 


74.21 


74.74 


74.34 


74.61 


74.34 


Glass 


68.10 


65.71 


66.19 


62.86 


61.90 


60.48 


59.05 


59.05 


59.05 


60.00 


HCleveland 


78.33 


79.33 


80.33 


81.00 


78.67 


80.33 


80.33 


80.33 


80.33 


79.33 


Heart 


76.30 


79.26 


80.37 


79.63 


80.74 


78.52 


78.15 


79.26 


79.63 


79.63 


Hepatitis 


80.67 


80.67 


83.33 


83.33 


85.33 


84.67 


85.33 


84.00 


84.00 


84.00 


Ionosphere 


87.14 


86.57 


83.14 


84.57 


84.57 


84.29 


82.86 


83.14 


80.86 


82.00 


Iris 


95.33 


96.67 


96.00 


96.00 


95.33 


95.33 


95.33 


96.00 


96.67 


96.00 


LiverBupa 


60.00 


63.53 


66.76 


63.24 


66.18 


64.12 


66.76 


65.29 


67.94 


66.18 


Sonar 


88.00 


85.00 


83.50 


83.00 


78.50 


79.00 


75.00 


73.50 


72.00 


76.50 


Vehicle 


68.57 


68.81 


69.17 


70.95 


68.81 


68.69 


67.86 


69.05 


68.21 


66.67 


Vote 


91.30 


93.04 


92.17 


92.17 


92.17 


91.74 


91.74 


91.30 


90.87 


91.30 


Wine 


95.88 


95.53 


94.12 


91.76 


94.12 


93.53 


95.29 


95.29 


95.29 


94.71 


Zoo 


96.67 


95.56 


94.44 


92.22 


91.11 


85.56 


83.33 


83.33 


82.22 


78.89 


Average 


81.03 81.93 82.11 81.64 81.43 80.75 80.43 80.26 80.13 


80.00 



In Table 2, heading N=i represents the number of neighborhoods used for 
aggregation. 

In Table 3, heading N = i represents the i th neighborhoods used for fcNN. 
The i th neighborhood contains 3 x * — 2 neighbors, i.e. k = 3 x i — 2. 
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Classification Accuracy of noA:NN(d) on Diabetes Data Set 
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Fig. 6. Classification accuracy of nofcNN(d) testing on Diabetes data set 



Classification Accuracy of /cNN on Diabetes Data Set 




Fig. 7. Classification accuracy of fcNN testing on Diabetes data set 



Figure 6 and Figure 7 show the full details of the performance of nofcNN(d) 
and fcNN testing on the Diabetes data set where the number of neighborhoods 
varies from 1 to 30. 

We also give the worst and best performance of fcNN together with the cor- 
responding “TV” values, and the performance of nofcNN(d) in Table 4 when ten 
neighborhoods are used for aggregation. In this experiment, we use the 10-fold 
cross validation method for evaluation. 

The experimental results show that the performance of fcNN varies when 
different neighborhoods are used while the performance of nofcNN(d) improves 
with increasing number of neighborhoods, but stabilizes after a certain number of 
stages [k = 6 in Figure 5). Furthermore the stabilized performance of nofcNN(d) 
is comparable (in fact slightly better in our experiment on fifteen data sets) to 
the best performance of fcNN within 10 neighborhoods. 



Experiment 2 In this experiment, our goal is to test whether or not the 
entropy-based neighborhood selection method can improve the classification ac- 
curacy of CPC. In the experiment, for each value of N, e.g. N=i, nofcNN(e) 
represents the average classification accuracy obtained when i neighborhoods 
are used for aggregation, and fcNN represents the average classification accuracy 
obtained when testing on the i th neighborhood. A comparison of entropy-based 
nofcNN(e) and fcNN with respect to classification accuracy using 10-fold cross 
validation is shown in Figure 8. 

To further verify our aggregation method, we also implemented a symmetric 
neighborhood selection method. Refer to section 3.2 for more details. 
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Table 4. A comparison of fcNN and noA;NN(d) in 10-fold cross validation 



Data set 


fcNN 

Worse case (N=10) 


fcNN 

Best case (N=3) 


nofcNN(d) 

(N=10) 


Australian 


85.51 


85.22 


85.51 


Colic 


85.50 


82.50 


83.33 


Diabetes 


74.34 


74.34 


74.47 


Glass 


60.00 


66.19 


70.00 


HCleveland 


79.33 


80.33 


83.67 


Heart 


79.63 


80.37 


81.48 


Hepatitis 


84.00 


83.33 


82.00 


Ionosphere 


82.00 


83.14 


84.00 


Iris 


96.00 


96.00 


96.00 


LiverBupa 


66.18 


66.76 


65.59 


Sonar 


76.50 


83.50 


87.00 


Vehicle 


66.67 


69.17 


71.31 


Vote 


91.30 


92.17 


92.17 


Wine 


94.71 


94.12 


94.71 


Zoo 


78.89 


94.44 


96.67 


Average 


80.00 


82.11 


83.19 




Fig. 8. A comparison of noA;NN(e) and &NN in classification accuracy 



Figure 9 shows that the similar results are obtained using the symmetric 
neighborhood selection method. 

A comparison of entropy-based nofcNN(e) with symmetric-based nofcNN(s), 
and distance-based nofcNN(d) in classification accuracy is shown in Figure 10 
It is obvious that the entropy-based CPC obtains better classification ac- 
curacy than the symmetric-based CPC and the distance-based CPC, especially 
when the number of neighborhoods for aggregation is relatively small, e.g. k < 
10. The experimental results justify our hypotheses: (1) the bias of k can be 
removed by CPC, and (2) the entropy-based neighborhood selection method 
indeed improves the classification accuracy of CPC. 

5 Conclusions 

In this paper we have discussed the issues related to the fcNN method for classi- 
fication. In order for fcNN to be less dependent on the choice of k, we looked at 
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Fig. 9. A comparison of noA;NN(s) and &NN in classification accuracy 



A Comparison of m>JtNN(s). m»ANN(c). and m>ANN(d) 
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Fig. 10. A comparison of noA;NN(d), nofcNN(s), and noA;NN(d) 



multiple sets of nearest neighbors rather than just one set of k nearest neighbors. 
A set of neighbors is called a neighborhood. For an instance t each neighborhood 
bears support for different possible classes. We have presented a novel formalism 
based on probability to aggregate the support for various classes to give a more 
reliable support value, which better reveals the true class of t. Based on this 
idea, for specific neighborhoods used in fcNN, which always surround around the 
instance t to be classified, we have proposed a contextual probability-based clas- 
sification method together with three different neighborhood selection methods. 
To choose a given number of neighborhoods with as much information for classi- 
fication as possible, the proposed entropy-based neighborhood selection method 
which partitions a multidimensional data space into grids and expands neighbor- 
hood each time with minimal information entropy among all candidates in this 
grids. This method is independent on “distance metric” or “similarity metric”. 

Experiments on some public data sets have shown that using nofcNN (whether 
nofcNN(d), nofcNN(s), or nofcNN(e)) the classification accuracy increases as the 
number of neighborhoods increases, but stabilizes after a small number of neigh- 
borhoods; using fcNN, however, the classification performance varies when dif- 
ferent neighborhoods are used. Experiments also have shown that the stabilized 
performance of nofcNN(d) is comparable to the best performance of fcNN. The 
comparison of entropy-based, symmetric-based, and distance-based CPC has 
shown that the entropy-based CPC obtains the highest classification accuracy. 
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Abstract. In this paper, a hybrid learning approach named Flexible 
NBTree is proposed. Flexible NBTree uses Bayes measure S to select 
proper test and applies post-discretization strategy to construct decision 
tree. The finial decision tree nodes contain univariate splits as regular 
decision trees, but the leaf nodes contain General Naive Bayes, which 
is a variant of standard Naive Bayesian classifier. Empirical studies on 
a set of natural domains show that Flexible NBTree has clear advan- 
tages with respect to the generalization ability when compared against 
its counterpart, NBTree. 

Keywords: Flexible NBTree; Bayes measure 5; General Naive Bayes; 
post-discretization 



1 Introduction 

Decision tree based methods of supervised learning represent one of the most 
popular approaches within the AI field for dealing with classification problems. 
They have been widely used for years in many domains such as web mining, data 
mining, pattern recognition, signal processing, etc. But standard decision tree 
learning algorithm [1] has difficulty in obtaining the relation between continuous- 
valued data points. It is a key issue in research to learn from data consisting of 
both continuous and nominal variables. 

Some researchers indicate that hybrid approaches can take advantage of both 
symbolic and connectionist models to handle tough problems. Much research 
has addressed the issue of combining decision tree with other learning algorithm 
to construct hybrid model. Baldwin et al. [2] used mass assignment theory to 
translate attribute values to probability distribution over the fuzzy partitions, 
then introduced probabilistic fuzzy decision trees in which fuzzy partitions were 
used to discretize continuous test universes. Tsang et al.[3] used a hybrid neural 
network to refine fuzzy decision tree and extracts a fuzzy decision tree with 
parameters, which is equivalent to a set of fuzzy production rules. Based on 
variable precision rough set theory, Zhang et al. [4] introduced a new concept of 
generalization and employed the variable precision rough sets (VPRS) model to 
construct multivariate decision tree. 



P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 327-335, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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By redefining test selection measure, this paper proposes a novel hybrid ap- 
proach, Flexible NBTree, which attempts to utilize the advantage of decision tree 
and Naive Bayes. The final classifier resembles Kohavi’s NBTree [5] but in two 
respects: 1. NBTree pre-discretizes the data set by applying an entropy based 
algorithm. Flexible NBTree applyies post-discretization strategy to construct 
decision tree. 2. NBTree uses standard Naive Bayes at the leaf node to han- 
dle pre-discretized and nominal attributes. Flexible NBTree uses General Naive 
Bayes (GNB), which is a variant of standard Naive Bayes, at the leaf node to 
handle continuous and nominal attributes in the subspace. 

The remainder of this paper is organized as follows: Section 2, 3 introduce the 
post-discretization strategy and GNB, respectively. Section 4 illustrates Flexible 
NBTree in detail. Section 5 presents the corresponding experimental results of 
compared performance with regarding to Flexible NBTree and NBTree. Section 
6 sums up whole paper. 

2 The Post-discretization Strategy 

When applying post-discretization strategy to construct decision tree, at each 
internal node in the tree, we first select the test which is the most useful for 
improving classification accuracy, then apply discretization of continuous tests. 

2.1 Bayes Measure <5 

In this discussion we use capital letters such as X, Y for variable names, and 
lower-case letters such as x, y to denote specific values taken by those variables. 
Let P(-) denote the probability, p(-) refer to the probability density function. 

Suppose the training set T consists of predictive attributes {Xi,--- ,X n } 
and class attribute C. Each attribute Xi is either continuous or nominal. The 
aim of decision tree learning is to construct a tree model which can describe the 
relationship between the predictive attributes {Xi , • • ■ ,X n } and class attribute 
C. 

Tree Model: {Xi, • • • , X„} -► C 

That is, the classification accuracy of the tree model on data set T should be 
the highest. Correspondingly the Bayes measure <5, which is introduced in this 
section as a test selection measure, is also based on this criterion. 

Let Xi represent one of the predictive attributes. According to Bayes theo- 
rem, if Xj is nominal then: 

P( c \ x i) = P( p^ |c) CX P(c)P(Xi\c). (1) 

Otherwise if Xj is continuous then: 

P{c\xi) = a P( c )p(xi\c). (2) 

V\Xi) 
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The aim of Bayesian classification is to decide and choose the class that 
maximizes the posteriori probability. When some instances satisfy Xi — Xi, 
their class labels are most likely to be: 



c,- = 



argmaxP(c)P(xi|c) (if Xi is nominal) 

cgC 

argmaxP(c)p(xi|c) (if X, is continuous) 

cgC 



( 3 ) 



Definition 1. Suppose Xi has m distinct values. We define the Bayes measure 
5 as: 

_ E”li Count (Xi = Xij AC = c*) 

~ N ' ^ ' 

where N is the size of set T. Intuitively spoken, 5 is the classification accuracy 
when classifier consists of attribute X, only. It describes the extent to which the 
model constructed by attribute Xi fits class attribute C. The predictive attribute 
which maximizes 8 is the one that is most useful for improving classification 
accuracy. 



2.2 Discretization of Continuous Tests 

The aim of discretization is to partition the values of continuous test Xi into a 
nominal set of intervals. According to (3), we have: 

c* = argmaxP(c)p(a:j|c). (5) 

cGC 

where conditional probability density function p(xi\c) is continuous. Given arbi- 
trary values x a and x b of attribute Xi, when x a — > Xb, there will be 

P(c)p(x a \c) — ► P(c)p(x b \c). 

So, the class labels inferred from (3) will not change within a small inter- 
val of the values of A,-. For clarification, suppose the relationship between the 
distribution of X, and C is shown in Fig. 1. 




Fig. 1. The relationship between the distribution of Xi and C 
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We can see from Fig. 1 that, 

{ C = Ci (x a < Xi < x b or x d <Xi< x e ) 

C = c 2 (xb < Xi < x c ) (6) 

G — C3 ( X c ^ Xi <C x d ) 

Note that the attribute values ci, 02,03 are inferred from ( 3 ), not the true class 
labels of training instances. And in the current example, there are three can- 
didate boundaries corresponding to the values of X, t at which the value of C 
changes: Xb,x c ,Xd- If we use these boundaries to discretize attribute X. t , the 
classification accuracy after discretization will be equal to S. So, the process of 
computing <5 is also the process of discretization. The Bayes measure <5 can also 
be used to automatically find the most appropriate boundaries for discretization 
and the number of intervals. 

Although this kind of discretization method can retain classification accuracy, 
it may lead to too many intervals. The Minimum Description Length (MDL) 
principle is used in our experimental study to control the number of intervals. 

Suppose we have sorted sequence S into ascending order by the values of Xi. 
Such a sequence is partitioned by boundary B to two subsets Si , £2 . The class 
information entropy of the partition denoted by E(Xi,B\ S) is given by: 

E(Xi, B- S ) = l^EntiS!) + l j^-Ent(S 2 ) 

where Ent(-) denotes the entropy function, 



Ent(Si) = P(c, S t ) log 2 P(c, Si) 

cGC 



and P(c, Si) stands for the proportion of the instances in Sj that belong to class 

c. 

According to MDL principle, the partitioning within S is reasonable iff 



Gain(Xi , B; S) > 



log 2 (JV-l) 

N 



A(Xi,B;S) 

N 



where Gain(Xi, B\ S) = Ent(S) — E(Xi, B; S) is the information gain, which 
measures the decrease of the weighted average impurity of the partitions Si, £ 2 , 
compared with the impurity of the complete set S. N is the number of instances 
in set S, A(Xi , B\ S) = log 2 (3 fc — 2) — [k ■ Ent(S) — k\ ■ Ent(Si) — &2 • Ent(S 2 )\, 
ki is the number of class labels represented in set S, . This approach can then 
be applied recursively to all adjacent partitions of attribute Xi, thus create the 
final intervals. 



3 General Naive Bayse (GNB) 

Naive Bayes comes originally from work in pattern recognition and is based on 
one assumption that predictive attributes X\, • • • , X n are conditionally indepen- 
dent given the class attribute C, which can be expressed as follows: 
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P{x i,--- ,x n \c) = JJP(a;j|c). 

i—1 

But when instance space contains continuous attributes, the situation is dif- 
ferent. For clarity, we first just consider two attributes: X\ (continuous) and 
X 2 (nominal) . Suppose the values of X\ have been discretized into a set of inter- 
vals, each corresponding to a nominal value. Then the independence assumption 
should be: 



P(x i < Xi < x\ + A, x 2 \c) = P(x i < X\ < x\ + A|c)P(a’ 2 |c). (7) 



where [a:i,a:i + A] is arbitrary interval of the values of attribute X- t . This as- 
sumption, which is the basis of GNB, supports very efficient algorithms for both 
classification and learning. By the definition of a derivative, 



P(c |xi < Xi < x\ + A, x 2 ) 



P(c)P( x\ < X\ < xi + A|c)P(:r 2 |c) 
P(x i < X\ < x\ + A|cc2)P(a:2) 
P(c)p(C\c)AP(x 2 \c) 
p(r]\x 2 )AP(x 2 ) 

P{c)p{(\c)P(x 2 \c) 

p{r]\x 2 )P(x 2 ) 



(8) 



where x\ < C,,r) < x\ + A. When A — » 0, P(c\x\ < X\ < x\ + A,x 2 ) 
P(c|a:i,a; 2 ) and (,r/ — » x\, hence 



^im^ P(c\x\ < Xi < x\ + A, x 2 ) = P(c\x\,x 2 ) 



P( y c)p{xi\c)P{x 2 \c) 

p{x\\x 2 )P{x 2 ) 



(9) 



We now extend (9) to handle a much more common situation. Suppose the 
first k of n attributes are continuous and the remaining attributes are nominal. 
Similar to the induction process of (9), we will have 



P(c|a:i, • • • ,x n ) = 



(X 



p(c) n,-u p(xi\c) n^fc+i p (xj ic) 

p{x l,--- , Xk\Xk+U • • • ,x n )P(x k + 1 ,--- , x n ) 

k n 

P{c)\\p{Xi\c ) P(Xj\c). 

i= 1 j=k+l 



(10) 



Then the classification rule of GNB is: 



k n 

c* = argmaxP(c|a:i, • • • , = arg max P(c) 1 [p(a!i|c)| [ P(xj\c). (11) 

cGC cGC 

i= 1 j=k -\- 1 

The probability P(c), P(xj\c) in (11) are estimated by using the Laplace-esti- 
mate and M-estimate [6], respectively. 
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Kernel-based density estimation [7] is the most widely used non-parametric 
density estimation technique. Compared with parametricdensity estimation tech- 
nique, it does not make any assumption of data distribution. In this paper we 
choose it to estimate conditional probability density p(xi\c) in Eq.(ll): 

1 n — 

^ E K(^) ( 12 ) 

fc = 1 

where Xk(k = 1, • • • , n) is the corresponding value of attribute X when C = c, 
K(-) is a given kernel function K(t) = (27r) _1 / 2 e _t / 2 . And h is the correspond- 
ing kernel width, n is the number of training instances when C = c. 

This estimate converges to the true probability density function if the kernel 
function obeys certain smoothness properties and the kernel width are chosen 
appropriately. One way of measuring the difference between the true p(xi\c) and 
the estimated p(xi\c) is the expected cross-entropy: 

-t n 1 n 

cv cE = --lt lo ^^—[ )h £ K (— f r 1 )) 

where h = ex/ y/n. and cx is chosen to minimize the estimated cross-entropy. In 
our experiments, we use an exhaustive grid search where grid width is 0.01 and 
the search is over cx € [0.2, 0.8] [8]. 

4 Flexible NBTree 

Kolravi proposes NBTree as a hybrid approach combining the Naive Bayes and 
decision tree. It has been shown that NBTree frequently achieves higher accuracy 
than either a Naive Bayes or a decision tree. Like NBTree, Flexible NBTree also 
uses a tree structure to split the instance space into subspaces and generates one 
Naive Bayes in each subspace. However, it uses a different discretization strategy 
and different version of Naive Bayes. 

The Flexible NBTree learning algorithm is shown as follows. 



Input: a training set T of pre-classified instances. 

Output: a hybrid decision tree with GNB at the leaves. 

1. From predictive attribute set {Xi, ■ • • , X n } , select test X % which maximizes 5. 

2. If Xi is continuous, partition its value into a set of intervals according to subsection 
2 . 2 . 

3. Partition T according to the value of Xi. If Xi is continuous, a multi-way split is 
made for all possible nominal intervals; If Xi is nominal, a multi-way split is made for 
all possible values. 

4. If the descendant node satisfies specific stopping criterions, create a GNB as the leaf 
node and return. If the descendant node belongs to the same class, create a class label 
as the leaf node and return. 

5. For each descendant node, the entire process is recursively repeated on the portion 
of T that matches the test leading to the node. 
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Table 1 . Average classification accuracy and standard deviation 



Data set 


NBTree 


Flexible NBTree 


Anneal 


98.3125 ± 1.5284 


98.6837 ± 1.28547 


Audiology 


77.3044 ± 7.5284 


82.7827 ± 6.81737 


Australian 


85.6257 ±4.0733 


83.2843 ±4.1834 


Breast-w 


93.6726 ± 2.6743 


94.5274 ±2.18277 


German 


71.3381 ±3.2943 


75.2732 ± 2.75117 


Glass 


67.6621 ±9.3321 


65.4732 ± 8.7038 


Heart-c 


76.9315 ±6.9162 


83.1621 ±6.49237 


Heart-h 


80.2725 ±8.0183 


85.2521 ± 6.33587 


Ionosphere 


89.7836 ± 4.4932 


88.1610 ±3.3275 


Iris 


94.7162 ± 5.3942 


96.3832 ± 4.98327 


Kr-vs-Kp 


99.4035 ± 0.4927 


97.4582 ± 0.8368 


Pima-indians 


74.5120 ± 5.3637 


77.1105 ±4.53747 


Primary-tumor 


41.4158 ±6.9217 


46.7932 ± 6.20337 


Segment 


96.8937 ± 1.6038 


95.4382 ± 1.5927 


Sick-enthyroid 


98.7892 ± 0.6948 


96.7042 ± 0.5134 


Soybean 


91.8824 ±3.2948 


93.5172 ± 2.72837 


Vehicle 


72.3943 ± 4.2036 


80.4983 ± 3.49437 


Zoo 


92.6462 ± 7.6494 


94.8285 ± 6.79327 



5 Experiments 

In order to evaluate the performance of Flexible NBTree and compare it against 
its counterpart, NBTree, we conducted an empirical study on 18 data sets from 
the UCI machine learning repository 1 . Each data set consists of a set of classi- 
fied instances described in terms of varying numbers of continuous and nominal 
attributes. For comparison purpose, the stopping criterions in our experiments 
are the same: the relative reduction in error for a split is less than 5% and there 
are no more than 30 instances in the node. 

The classification performance was evaluated by ten-folds cross-validation for 
all the experiments on each data set. Table 1 shows classification accuracy and 
standard deviation for Flexible NBTree and NBTree, respectively, ’7’ indicates 
that the accuracy of Flexible NBTree is higher than that of NBTree at a signif- 
icance level better than 0.05 using a two-tailed pairwise t-test on the results of 
the 20 trials in a data set. 

From Table 1, the significant advantage of Flexible NBTree over NBTree 
in terms of the higher accuracy can be clearly seen. In order to investigate the 
reason(s), we analyze the experimental results on data set Breast-w in particular. 
Figure 2 shows the comparison of classification accuracy for Flexible NBTree and 
NBTree. 

When N (the training size of data set Breast-w) < 650, the tree struc- 
tures that learned from these two algorithms are almost the same. But when 

1 ftp:/ /ftp. ics.uci.edu/pub/machine-learning-databases 
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Fig. 2. Comparison of the classification accuracy 



N > 650, the decision node in the second layer of Flexible NBTree contains uni- 
variate test <Bare Nuclei> and that learned from NBTree contains test <Cell 
Shape>. Correspondingly from Fig. 2 we can see that, when N — 600 Flexible 
NBTree achieves 92.83% accuracy on the test set while NBTree reaches about 
92.73%. When N = 650 Flexible NBTree achieves 93.51% accuracy while NBTree 
reaches about 92.92%. The error reduction increases from 1.38% to 8.33%. We 
attribute this improvement to the effectiveness of post-discretization strategy. 
Since no information-lossless discretization procedure is available, some helpful 
information may lose in the transformation from infinite numeric area to finite 
subintervals. We conjecture that pre-discretization does not take full advantage 
of the information that continuous attributes supply and this may affect the 
cutting points of continuous test or even test selection in the process of tree 
construction, thus degrade the classification performance to some extent. But 
post-discretization strategy applies discretization only when necessary, thus the 
possibility of information loss can be reduced to minimum. 

6 Summary 

Pre-discretization is a common choice for handling continuous attributes in ma- 
chine learning. But the information loss may affect classification performance 
negatively. In this paper, we propose a novel learning approach, Flexible NBTree, 
which is hybrid of decision tree and Naive Bayes. Flexible NBTree applies post- 
discretization strategy to mitigate the negative effect of information loss. Ex- 
periments with natural domains showed that Flexible NBTree generalizes much 
better than NBTree. 
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Abstract. Relationships are an essential part of the design of a database be- 
cause they capture associations between things. Comparing and integrating re- 
lationships from heterogeneous databases is a difficult problem, partly because 
of the nature of the relationship verb phrases. This research proposes a multi- 
layered approach to classifying the semantics of relationship verb phrases to as- 
sist in the comparison of relationships. The first layer captures fundamental, 
primitive relationships based upon well-known work in data abstractions and 
conceptual modeling. The second layer captures the life cycle of natural pro- 
gressions in the business world. The third layer reflects the context-dependent 
nature of relationships. Use of the classification scheme is illustrated by com- 
paring relationships from various application domains with different purposes. 



1 Introduction 

Comparing and integrating databases is an important problem, especially in an in- 
creasingly networked world that relies on inter-organizational coordination and sys- 
tems. With this, is the need to develop new methods to design and integrate disparate 
databases. Database integration, however, is a difficult problem and one for which 
semi-automated approaches would be useful. One of the main difficulties is compar- 
ing relationships because their verb phrases may be generic or dependent upon the 
application domain. Being able to compare the semantics of verb phrases in relation- 
ships would greatly facilitate database design comparisons. It would be even more 
useful if the comparison process could be automated. Fully automated techniques, 
however, are unlikely so solutions to integration problems should aid integrators, but 
require minimal work on their part [Biskup and Embley, 2003]. The objective of this 
research is to: propose an ontology for understanding the semantics of relationship 
verb phrases by mapping the verb phrases to various categories that capture different 
interpretations. Doing so requires that a classification scheme be developed that cap- 
tures both the domain-dependent and domain independent nature of verb phrases. The 
contribution of this research is to provide a useful approach to classifying verb 
phrases so relationships can be compared in a semi-automated way. 
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2 Related Work 

The design of a database involves representing the universe of discourse in a structure 
in such a way that it accurately reflects reality. Conceptual modeling of databases is, 
therefore, concerned with things (entities) and associations among things (relation- 
ships) [Chen 1993; Wand et al. 1999]. A relationship R, can be expressed as A verb 
phrase B (A vp B), where A and B are entities. Most database design practices use 
simple, binary associations that capture these relationships between entities. A verb 
phrase, which is selected by a designer with the application domain in mind, can 
capture much of the semantics of the relationship. Semantics, for this research, is 
defined as the meaning of a term or a mapping from the real world to a construct. 
Understanding a relationship, therefore, requires that one understand the semantics of 
the accompanying verb phrase. Consider the relationships from two databases: 

Customer (entity) buys (verb) Product (entity) 

Customer (entity) purchases (verb) Product (entity) 

These relationships reflect the same aspect of the universe of discourse, and use syn- 
onymous verb phrases. Therefore, the two relationships may be mapped to a similar 
interpretation, recognized as identical, and integrated. Next, consider: 

Customer reserves Car 
Customer rents Car. 

These relationships reflect different concepts from the universe of discourse. The first 
captures the fact that a customer wants to do something; the second, that the customer 
has done it. These may be viewed as different states in a life cycle progression, but 
the two underlying relationships cannot be considered identical. Thus, they could not 
be mapped to the same semantic interpretation. Finally, consider: 

Manager considers Agreement 
Manager negotiates Agreement. 

The structures of the relationships suggest that both relationships represent an interac- 
tion. However, “negotiates” implies changing the status, whereas “considers” in- 
volves simply viewing the status. On the other hand, 

Manager makes Agreement 
Manager writes Agreement 

may capture an identical notion of creation. These examples illustrate the importance 
of employing and understanding how a verb phrase captures the semantics of the 
application domain. The interpretation of verbs depends upon the nouns (entities) that 
surround them [Fellbaum, 1998]. 

Research has been carried out on defining and understanding ontology creation 
and use. There are different definitions and interpretations of ontologies [Weber 
2002], In general, though, ontologies deal with capturing, representing, and using 
surrogates for the meanings of terms. This research adopts the approach of Dahlgren 
[1988] who developed an ontology system as a classification scheme for speech un- 
derstanding and implemented it as an interactive tool. Work on ontology development 
has been carried out in database design (Embley et al. [1999], Kedad and Metais 
[1999], Dullea and Song [1999], Bergholtz and Johannesson [2001]). These efforts 
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provide useful insights and build upon data abstractions. However, no comprehensive 
ontology for classifying relationships has been proposed. 

3 Ontology for Classifying Relationships 

This section proposes an ontology for classifying the verb phrases of relationships. 
The ontology is of the type developed by Dahlgren [1988] which operates as an inter- 
active system to classify things. The most important part is the classification scheme. 
It is the focus of this research and is divided into three layers (Figure 1). The layers 
were developed by considering: 1) prior research in data modeling, in particular, data 
abstractions and the inherent business life cycle; 2) the local context of the entities; 
and 3) the domain-dependent nature of verb phrases. 




Fig. 1. Relationship classification levels 

3.1 Fundamental Categories 

The fundamental categories are primitives that reflect a natural division in the real 
world. This category has three general classes that form the basis of how things in the 
real world can be associated with each other: status, change in status, and interaction 
as shown in Figure 2. 



Status: the orientation of one entity towards the other entity, e.g. A <is-owner- 
of> B 

Change of status: change of one entity with respect the other, e.g. A <becomes- 
owner-of> B . 

Interaction: communication or operation between entities that does not result in a 
change of status, e.g. A <sends-message-to> B. 



Fig. 2. Fundamental Categories 
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Status captures the fact that one thing has a status with respect to the other. These 
are primitive because they describe a permanent, or durable, association of one entity 
with another, expressing the fact that A <is something> with respect to B. Business 
applications follow a natural life cycle of conception or creation through to ownership 
and, eventually, destruction. The change of status category describes this transition 
from one status to another. Relationships in this category express the fact that A is 
transitioning from A < is something> with respect to B to A <is something else> with 
respect to B. An interaction does not necessarily lead to a change of status of either 
entity. This happens when the effect of an interaction is worth remembering. Con- 
sider the verb phrase, ‘create.’ In some cases, it is useful to remember this as a status 
<is-creator-of> as in Author writes Book. In other cases, the interaction itself is im- 
portant, even if it does not result in a change of status. The interaction category, 
therefore, expresses the fact that A <is doing something> with respect to B. 

These fundamental categories are sufficiently coarse that all verb phrases will map 
to them. They are also coarse enough to warrant finer categories to distinguish among 
the large set of relationships in each category. Thus, further refinement is needed for 
each fundamental category. 

3.1.1 Refining the Category: Status 

The ‘Status’ category has been extensively studied by research on data abstractions, 
which focuses on the structure of relationships as a surrogate for understanding their 
semantics. Most data abstractions associate entities at different levels of abstraction 
(sub/superclass relationships) [Goldstein and Storey, 1999]. Since data abstractions 
infer semantics based on the structure of relationships, they, thus, provide a good start 
point for understanding the semantics of relationships. Research on understanding 
natural language also provides verb phrase categories such as auxiliary, generic and 
other types. 

The first layer captures fundamental differences between kinds of relationships and 
was build by considering prior, well-accepted research on data abstractions and other 
frequently-used verb phrases whose interpretation is unambiguous. These are inde- 
pendent of context. This category, thus, captures the fundamental ways in which 
things in the real world are related so the categories in this level can be used to distin- 
guish among the fundamental types. Additional results from research on patterns 
[Coad, 1995] and linguistic analysis [Miller, 1990] results in a hierarchical classifica- 
tion with defined primitives at the leaves of the tree. Figure 3 shows this finer classi- 
fication of the category ‘Status.’ 

Examples of primitive status relationships are shown in Table 1. There are two 
variations of one thing being assigned to another: is-assigned-to and is-subjected-to. 
In A is-subjected-to B, A does not have a choice with respect to its relationship with 
B, whereas it might in the former. Temporal relationships capture the sequence of 
when things happen and can be clearly categorized as before, during, and after. 

3.1.2 Refining the Category: Change of Status 

The change-of-status primitives, in conjunction with the status primitives, capture the 
lifecycle transitions for each status. Although the idea of a lifecycle has been alluded 
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Fig. 3. Primitives for the Category ‘Status’ 



Table 1. Primitives for ‘Status’ category 





Primitive 


Example 


Source 


1 


A is-a B 


Pilot <is an> Employee 


[Brachman 1983] 


2 


A is-member-of B 


Professor <is member of> De- 
partment 


[Brodie 1981] 


3 


A is-part-of B 


Car <has> Engine 


[Smith and Smith 1977] 


4 


A is-instance-of B 


Video Tape <is copy of> Movie 


[Motsching-Pitrik and 
Mylopoulous 1992] 


5 


A is-version-of B 


Draft <is version of> Manuscript 


[Motsching-Pitrik, 2000] 


6 


A is-descriptor-of B 


Document <defines> Task 


[Larmon 1997, p. 156] 


7 


A is-creator-of B 


Author <writes> Book 


[Gamma et al. 1995, p. 
87] 


8 


A is-destroyer-of B 


Tennant <terminates> Lease 


[Gamma et al. 1995, p. 
266] 


9 


A is-owner-of B 


Company <owns> Building 


[Larmon 1997, p. 157] 


10 


A is-in-control-of B 


Manager <leads> Team 


[Larmon 1997, p. 156] 


11 


A is-assigned-to B 


Employee <assigned to> Project 


Added 


12 


A is-subjected-to B 


Industry <regulated by> Law 


Added 


13 


A follows-or- 
precedes B 


Rental <follows> Reservation 


[Hay 1996, Chp. 5] 


14 


A requires B 


Construction <requires> Ap- 
proval 


Added 


15 


A is-next-to B 


San Andreas Fault <is-next-to> 
Los Angeles 


[Larmon 1997, p. 156; 
Hay 1996, p. 361 


16 


A has-attitude- 
towards B 


Customer <likes> Product 


Added 



to previously [Hay 1996], prior research has not systematically recognized the lifecy- 
cle concept. Our conceptualization of the ‘Change of Status’ category is based on an 
extension and understanding of each primitive in the ‘Status’ category during the 
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business lifecycle. Consider verb phrases that deal with acquiring something, as is 
typical of business transactions related to the status primitive ‘is-owner-of.’ The 
lifecycle for this status primitive has the states shown in Figure 4. 




Fig. 4. The Relationship Life Cycle 



Each state may, in turn, be mapped to different status primitives. For example, the 
lifecycle starts with needing something (‘has-attitude-towards’ and ‘requires’) which 
is followed by intending to become an owner (‘acquire’ or ‘create’), owning (‘owner’ 
or ‘in-control-of’) and giving up ownership (‘seller’ or ‘destroyer’). The primitives 
therefore illustrate a lifecycle that goes through creation or acquisition, ownership, 
and destruction. The life cycle can be logically divided into: intent, attempt to ac- 
quire, transition to acquiring, intent to give up, attempt to give up, and transition to 
giving up. Table 2 shows this additional information superimposed on the different 
states within the lifecycle. The sub-column under the change-of-status primitives 
shows the meanings captured in each: intent, attempt and the actual transition. 



Table 2. Primitives for the Category ‘Change of Status’ 



Primitive 


Example 


A wants-to-be B 


intent 


Customer <wants to own> Product 


A attempts-to-become owner of B 


attempt 


Customer <orders> Product 


A becomes B 


transition 


Customer <receives> Product 


Status Primitive: Customer <owns> Product 


A dislikes-being B 


intent 


Company <wants to sell> Product 


A attempts-to-give-up B 


attempt 


Company <offers> Product 


A gives-up ownership-of B 


transition 


Company <sells> Product 



3.1.3 Refining the Category: Interaction 

‘Interaction’ describes communication of short duration between two entities or an 
operation of one entity on another. The interaction may cause a change in one of the 
entities. For example, one entity may ‘manipulate’ another [Miller, 1990], or cause 
movement of the other through time or space (‘transmit,’ ‘receive’). Two entities may 
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interact without causing change to either (‘communicate with,’ ‘observe’). One entity 
may interact with another also by way of performance (‘operate,’ ‘serve’). Figure 5 
shows the primitives for ‘Interaction’ with examples given in Table 3. 




Fig. 5. Primitives for the Category ‘Interaction’ 



Table 3. Primitives for the Category ‘Interaction’ 





Primitive 


Example 


1 


View Status 


Analyst <analyses> Requirements 


2 


Select 


Customer <selects> Product 


3 


Communicate 


Modem cnegotiates with> Phone Line 


4 


Perform 


Developer <tests> Software 


5 


Operate 


Pilot <flies> Plane 


6 


Serve 


Employee <serves> Customer 


7 


Manipulate 


Instructor <grades> Exam 


8 


Transmit 


Bank <remits> Payment 


9 


Receive 


Warehouse <receives> Shipment 



3.2 The Local (Internal) Context 

The second category captures internal context by taking into account the nature of the 
entities surrounding the verb phrase, highlighting the need to understand the nouns 
that surround verb phrases [Fellbaum, 1998]. For this research, entities are classified 
as: actor, action, and artifact. Actor entities are capable of performing independent 
actions. Action represents the performance of an act. Artifact represents an inanimate 
object not capable of independent action. After entities have been classified, valid 
primitives can be specified for each pair of entity types. For example, it does not 
make sense to allow the primitive ‘perform’ for two entities of the kind ‘Actor.’ On 
the other hand, this primitive is appropriate when one of the entities is classified as 
‘Actor’ and the other as ‘Action.’ The argument can be applied both to the ‘Status’ 
and ‘Interaction’ primitives. Because the ‘Change of Status’ primitives capture the 
lifecycle of ‘Status’ primitives, constraints identified for ‘Status’ primitives apply to 
the ‘Change of Status’ primitives as well. Table 4 shows these constraints for ‘Status’ 
primitives. Similar constraints have been developed for the ‘Interaction’ primitives. 
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Table 4. Valid ‘Status’ Primitives based on Entity Context 



Entity 1 


Entity 2 


Valid Status Primitives 


Actor 


Actor 


control, creates/destructor, attitude, sequence, structure 


Actor 


Action 


control, creates/destructor, attitude, not causing change 


Actor 


Artifact 


control, creates/destructor, causing change, not causing 
change, transfer, exchange 


Action 


Action 


control, creates/destructor, attitude, sequence, not causing 
change, 3E 


Action 


Artifact 


control, creates/destructor, structure, causing change, not 
causing change 


Artifact 


Artifact 


control, creates/destructor, structure, causing change, not 
causing change, transfer, exchange 



3.3 Global (External) Context 

The third level captures the external context, that is, the domain in which the relation- 
ship is used, reflecting the domain-dependent nature of verb phrases. Although at- 
tempts have been made to capture taxonomies of such domain-dependent verbs, a 
great deal of manual effort has been involved. This research takes a more pragmatic 
approach where a knowledge base of domain-dependent verb phrases may be con- 
structed over time when the implemented ontology is being used. When the user clas- 
sifies a verb phrase, its classification and application domain should be stored. Con- 
sider the use of ‘opens’ in a theatre database versus a bank database. The relationship 
Character <opens> Door in the theatre domain maps to the interaction primitive 
<manipulates>. In the bank application. Teller <opens> Account maps to the status 
primitive <is-creator-of>; Customer <opens> Account maps to <is-owner-of>. If a 
verb phrase has already been classified by a user, it can be suggested as a preliminary 
classification for additional users, who are interested in classifying it. If a verb phrase 
has already been classified by a different user for the same application domain, then 
that classification should be displayed to the user who would agree with the classify- 
cation or provide a new classification. New classifications will also be stored. Ideally, 
consensus will occur over time. This way the knowledge base builds up, ensuring that 
the verbs important to different domains are captured appropriately. The following 
will be stored: 

[Relationship, Verb phrase classification, Application Domain, User] 

3.4 Use of the Ontology 

The ontology can be used for comparing relationships across two databases by first 
comparing the entities, followed by classification of the verb phrases accompanying 
the relationships. Examples are shown in Table 5. 

The ontology consists of a verb phrase classification scheme, a knowledge base 
that stores the classified verb phrases, organized by user and application, and a user- 
questioning scheme as mentioned above. The user is instructed to classify the entities 
of a relationship as actor, action, or artifact. The next step is to classify the verb 
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Table 5. Relationship comparison considering classification of entities 



Relationship 1 


Relationship 2 


Relationship 

Comparison 


Contractor builds Bridge 
(Actor-Artifact) 


Builder Constructs Tree house 
(Actor-Artifact) 


Entities similar; 
compare verb phrases 


Contractor builds Bridge 
(Actor-Artifact) 


Contractor has Employee 
(Actor-Actor) 


Entities differ, do not 
compare verb 
phrases) 


Contract builds Bridge 
(Actor-Artifact) 


Worker does Raking 
(Actor-Action) 


Relationships differ 


Manager has Employee 
(Actor-Actor) 


Manager employs Worker 
(Actor-Actor) 


Entities similar; 
compare verb phrases 


Manager gets Employee 
(Actor-Actor) 


Manager gets Contract 
(Actor-Artifact) 


Entities differ; do not 
compare verb phrases 


Employee finds Apprentice 
(Actor-Actor) 


Employ does Allocation 
(Actor-Action) 


Entities differ; do not 
compare verb phrases 



phrase. First, the user is asked to select one the three categories: ‘Status,’ ‘Interac- 
tion,’ or ‘Change of Status.’ Based on this selection, and the constraints provided by 
the entity types, primitives within each category are presented to the user for an ap- 
propriate classification. Suppose a user classifies a relationship as ‘Status.’ Then, 
knowing the nature of the entities, only certain primitives are presented as possible 
for the classification of the relationship. Furthermore, identifying that a verb phase is 
either status, change or status, or interaction restricts the subset of categories from 
which an appropriate classification can be obtained and, hence, the options presented 
to the user. If the verb phrase cannot be classified in this way, then, the other levels 
are checked to see if they are needed. 

4 Assessment 

Assessing an ontology is a difficult task. A plausible approach to assessment of an 
ontology is suggested by Gruninger and Fox [1995]. They suggest evaluating the 
‘competency’ of an ontology. One of the ways to determine this ‘competency’ is to 
identify a list of queries that a knowledge-base, which builds on the ontology, should 
be able to answer (competency queries). Based on these queries, the ontology may be 
evaluated by posing questions such as: Does the ontology contain enough information 
to answer these types of queries? Do the answers require a particular level of detail or 
representation of a particular area? Noy and McGuiness [2001 1 suggest that the com- 
petency questions may be representative, but need not be exhaustive. Following our 
intent of classifying relationships for the purpose of comparison across databases, we 
attempted to assess whether the classification scheme of the ontology can provide a 
correct and complete classification of relationship verb phrases. To do so, a study was 
carried out which involved the following steps: 1) generation of the verb phrases to 
be classified; 2) generation of relationships using the verb phrases in different appli- 
cation domains; and 3) classification of all verb phrases. 
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Step 1: Generation of Verb Phrase 

Only business-related verbs were used because the intent of the relationship ontology 
is use for business databases. Furthermore, it restricts the scope of the research. Since 
the SPEDE verbs [Cottam, 2000] were developed for business applications, these 
automatically became part of the sample set. The researchers independently selected 
business-related verbs from a set of 700 generated randomly from WordNet. The 
verbs that were common to the selections made by both researchers were added to the 
list from SPEDE. The same procedure was carried out from a set of 300 verbs that 
were randomly selected by people who support the online dictionary http://dictionary. 
cambridge.org/. This resulted in a total of 211 business verbs. 

Step 2: Generation of Relationships Containing Verbs by Application Domain 
For each verb, a definition was obtained from the on-line dictionary. Dictionaries 
provide examples for understanding and context, which helped to generate the rela- 
tionships. Relationships were generated for seven application domains (approxi- 
mately 30 verbs in each): 1) education, 2) business management, 3) manufacturing, 4) 
airline, 5) service, 6) marketing, 7and ) retail. Examples are shown in Table 6. 



Table 6. Sample test relationships 



Verb 

phrase 


Source 


Meaning(s) 


Domain 


Generated 

Example 


Import 


SPEDE 


to buy or bring in (products) from 
another country 


Manufac- 

turing 


Manufactures 
import Cars 


Obtain 


SPEDE 


to get (something), esp. by asking 
for it, buying it, working for it or 
producing it from something else 


Education 


Students 
obtain De- 
grees 


Collect 


SPEDE 


to gather together from a variety 
of places or over time 


Airline 


Agent collects 
Ticket 


Accept 


SPEDE 


to agree to take (something), or to 
take (something) as satisfactory, 
reasonable, true 


Retail 


Supermarkets 
accept Credit 
Cards 


Hire 


SPEDE 


to pay to use (something) for a 
short period or to pay (someone) 
to do a job temporarily 


Service 


Travelers hire 
Cars 



After generating the relationships, the researchers independently classified them 
using the relationship ontology. First, 30 verbs were classified and the researches 
agreed on 80% of the cases. The remaining verbs were then classified. The next step 
involved assessing how many of the ontology classifications the set of 211 verbs 
covered to test for completeness. The researchers generated additional relationships 
for ten subclasses for a total of 225. Sample classifications are shown in Table 7. 

The results of this exercise were encouraging, especially given our focus on evalu- 
ating the competency of the ontology [Gruninger and Fox 1995]. The classification 
scheme worked well for these sample relationships. It allowed for the classification of 
all verb phrases. The biggest difficulty was in identifying whether to move from one 
level to the next. For example. Student acquires Textbook is immediately classifiable 
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Table 7. Sample classifications of relationships 



Entity 1 


Verb Phrase 


Entity 2 


Classification 


Manufacturer 


Imports 


Part 


receives 


Student 


acquires 


Textbook 


becomes-owner-of 


Air traffic controller 


establishes 


Flight path 


becomes-creator-of 


Salesperson 


converts 


Competitor-customer 


manipulates 


Customer 


enters-into 


Sales agreement 


becomes-subjected-to 


Caterer 


Delivers-to 


Plane 


is-assigned-to 


Teacher 


distributes 


Handout 


sends 


Airline 


Adjusts 


Schedule 


manipulates 



by the primitives. In other cases, the next layer was necessary. Further research is 
needed to design a user interface that can explain the use and categories to the user so 
they can be effectively applied. A preliminary version of a prototype has been devel- 
oped. This will be completed and an empirical test carried out with typical end-users, 
most likely, database designers. 

5 Conclusion 

A classification scheme for comparing relationship verb phrases has been presented. 
It is based upon results obtained from research on conceptual modeling, common 
sense knowledge of a typical life cycle, and the domain-dependent nature of relation- 
ships. Further research is needed to complete the ontology system for which the clas- 
sification scheme will be a part. Then, it needs to be expanded to allow for multiple 
classifications and the user interface refined. 
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Abstract. Maximal frequent itemsets mining is a fundamental and important 
problem in many data mining applications. Since the MaxMiner algorithm in- 
troduced the enumeration trees for MFI mining in 1998, there have been several 
methods proposed to use depth-first search to improve performance. This paper 
presents FIMfi, a new depth-first algorithm based on FP-tree and MFI-tree for 
mining MFI. FIMfi adopts a novel item ordering policy for efficient lookaheads 
pruning, and a simple method for fast superset checking. It uses a variety of old 
and new pruning techniques to prune the search space. Experimental compari- 
son with previous work reveals that FIMfi reduces the number of FP-trees cre- 
ated greatly and is more than 40% superior to the similar algorithms on aver- 
age. 



1 Introduction 

Since the frequent itemsets mining problem (FIM) was first addressed [1], frequent 
itemsets mining in large database have been an important problem for it enables es- 
sential data mining such as discovering association rules, data correlations, sequential 
patterns, etc. 

There are two types of algorithms to mine frequent itemsets. The first one is can- 
didate set generate-and-test approach [1], The basic idea is to generate and test the 
candidate itemsets. Each candidate itemset with k+1 items is only generated from 
frequent itemsets with k items. This process is repeated in bottom-up fashion until no 
candidate itemset can be generated. In each level, all the frequencies of the candidate 
itemsets are tested by scanning the database. But this method requires scanning the 
database several times. In the worst case, the number of the scan is equal to the 
maximal length of the frequent itemsets. Besides this, lots of candidate itemsets is 
generated, most of them would be infrequent. Another method is data transformation 
approach [2, 4]: it avoids the cost of generating and testing a large number of candi- 
date sets by growing a frequent itemset from its prefix. It constructs a sub database 
related to each frequent itemset h such that all frequent itemsets that have h as prefix 
can be mined only using the sub database. 
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The number of frequent itemsets increases exponentially with the increasing of fre- 
quent itemsets’ length. So large length of frequent itemsets leads to no feasible of FI 
mining. However, since frequent itemsets are upward closed, it is sufficient to 
discover only all maximal frequent itemsets. As a result, researchers now turn to find 
MFI (maximal frequent itemsets) [5,6,9,10,4,7]. A frequent itemset is called maximal 
if it has no frequent superset. Given a set of MFI, it is easy to analyze some interest- 
ing properties of the database, such as the longest pattern, the overlap of the MFI, etc. 
Also, there are applications where the MFI is adequate, for example, the combinato- 
rial pattern discovery in biological applications [3]. 

This paper focuses on the MFI mining problems based on data transformation ap- 
proach. We use FP-tree to represent sub database containing all relevant frequency 
information, and MFI-tree are used to store information of discovered MFI that is 
useful for superset frequency pruning. With these two data structure, our algorithm 
takes a novel item ordering policy, and integrates a variety of old and new prune 
strategies. It also uses a simple but fast superset checking method along with some 
other optimizations. 

The remaining of this paper is organized as follows. In section 2, we briefly review 
the MFI mining problem and introduce the related works. Section 3 gives the MFI 
mining algorithm, FIMfi, which does the MFI mining based on FP-tree and MFI-tree. 
In this section we also introduce our novel item ordering policy, the prune strategies 
we applied and the simple but fast superset checking that is needed in efficient “ loo- 
kaheads ” pruning. In section 4, we compare our algorithm with some previous works. 
Finally, section 5 gives the conclusions. 



2 Preliminaries and Related Work 

This section will formally describe the MFI mining problem and the set enumeration 
tree that represents the searching space. Also the related works and two important 
data structure, FP-tree and MFI-tree, which is used in our scheme, will be introduced 
in this section. 

2.1 Problem Revisit 

Let I = { / 1 , U , . . . , ( be a set of m distinct items. Let D denote a database of transac- 

tions, where each transaction contains a set of items. A set XC/ is also called an 
itemset. An itemset with k items is called a k-itemset. The support of an itemset X, 
denoted as sup(X), is the number of transactions in which X occurs as a subset. For a 
given D and the threshold min_sup, itemset X is frequent if sup(X) > min_sup. If 
sup(X) > min_sup and for any YZ) X, we have sup(Y) < min_sup, then X is called 
maximal frequent itemset. From the definitions we can have two lemmas as follows: 

Lemma 1: A restricted subset of any frequent itemset is not a maximal frequent item- 
set. 

Lemma2: A subset of any frequent itemset is a frequent itemset, a superset of any 
infrequent itemset is not a frequent itemset. 
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Given a transactional database D, supposed I is an itemset of it, then any combina- 
tion of the items in / would be frequent and all these combinations compose the 
search space, which can be represented by set enumeration tree [5], For example, 
supposed I = { a,b,c,d,ef) is sorted in firm lexicographic order, then the searching tree 
can be shown as Figure 1. To avoid the tree too big, we use subset infrequency prun- 
ing and superset frequency pruning technique in the tree, and we will introduce the 
two pruning techniques in next section. The root of the tree represents the empty 
itemset, and the nodes at level k contain all of the /.'-itemsets. The itemset associated 
with each node, n, will be referred as the node's head (n). The possible extensions of 
the itemset is denoted as con_tail(n), which is the set of items after the last item of 
head(n ). The frequent extensions denoted as fre_tail(n) is the set of items that can be 
appended to head(n) to build the longer frequent itemsets. In depth-first traversal of 
the tree, fre_tail(n) contains only the frequent extensions of n. The itemset associated 
with each children node of node n is build by appended one of fre_tail(n) to head ( n ). 
As example in Figure 1, suppose node n is associated with [b], then head (n) - {b} 
and con_tail(n ) = {c,d,ef}. We can see that {/?/} is not frequent , fre _tail(n) = {c,d,e }. 
The children node of n, j h,e), is build by appending e from fre _tail(n) to {£>}. 



The problem of MFI mining can be thought as to find a border of the tree, all the 
elements above the border are frequent itemsets, and others are not. All MFIs is near 
the border. As our examples in Figure 1, itemsets in ellipses are MFI. 

2.2 Related Work 

Given the set enumeration tree, we can describe the most recent approaches to MFI 
mining problem. 

The MaxMiner [5] employs a breadth-first traversal policy for the searching. To 
reduce the search space according to lemma 1, it performs not only subset infre- 
quency pruning to skip over the itemset that have an infrequent subset, but also su- 
perset frequency pruning (also called lookaheads pruning). To increase the effective- 
ness of superset frequency pruning, MaxMiner dynamically reorders the children 
nodes, which was used in all the MFI algorithms after it [4,6,7,9,10]. Normally depth- 
first approach have better performance on lookaheads, but MaxMiner uses a breadth- 
first approach instead to limit the number of passes over the database. 




Fig. 1 . Search space tree 
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DepthProject performs a mixed depth-first traversal, and do the subset infrequency 
pruning and a variation of superset frequency pruning [6] to the tree. Also it uses an 
improved counting method based on transaction projections along its branches. The 
original database and the projections are represented as a bitmap. The experiment 
results in [6] show that DepthProject outperforms MaxMiner by more than one time. 

Mafia [7] is another depth-first algorithm, it also uses a vector bitmap representa- 
tion, where the count of an itemset is based on the column in the bitmap. Besides the 
two pruning methods we mentioned above, another novel pruning technique called 
PEP (Parent Equivalence Pruning) in [8] is also used in Mafia, The experiments in 
[7] shows that PEP prunes the search space greatly. 

Both DepthProject and Mafia mine a superset of the MFIs, and require a post- 
pruning to eliminate non-maximal frequent itemsets. GenMax [9] integrates the prun- 
ing with the mining and finds the exact MFIs by using two strategies. First, just like 
transaction database is projected on current node, the discovered MFI set can also be 
projected on the node (Local MFI) and thus yields fast superset checking; Second, 
GenMax uses Diffset propagation to do fast frequency computation. 

AFOPT [3] uses a data structure called AFOPT tree in which items are ascending 
frequency ordered to store the transactions in original database. It also uses subset 
infrequency pruning, superset frequency pruning and PEP pruning to reduce the 
search space. And it employs LMFI generated by pseudo projection technique to test 
whether a frequent itemset is subset of one of it. 

FPMax* is an extension of the FP-growth method, for MFIs mining only. It uses a 
FP-tree to store the transaction projection of the original database for each node in the 
tree. In order to test whether a frequent itemset is the subset of any discovered MFI in 
lookaheads pruning, another tree structure (MFI-tree) is utilized to keep the track of 
all discovered MFI, this makes effective superset checking. FPMax* uses an array for 
each node to store the counts of all 2-itemsets that are subset of the frequent exten- 
sions itemset, this makes the algorithm scan each FP-tree only once for each recursive 
call emanating from it. The experiment results [10] shows that FPMax* has the best 
performance for almost all the tested database. 

2.3 FP-Tree and MFI-Tree 

The FP-growth method [2] builds a data structure called the FP-tree (Frequent Pattern 
tree) for each node of the search space tree. FP-tree is a compact representation of all 
relevant frequency information of current node, each of its path from the root to a 
node represents an itemset, and the nodes along the paths are stored according to the 
order of the items in fre_tail{n). Each node of the FP-tree also stores the number of 
transactions or conditional pattern bases which containing the itemset represented by 
the path. Compression is achieved by building the tree in such a way that overlapping 
itemsets share prefixes of the corresponding branches. 

Each FP-tree of the nodes is associated with a header table. Single items in tail and 
the support of itemset that is the union of head and the item are stored in the header 
table in decreasing order of the support. The entry for an item also contains the head 
of a list that links to all the corresponding nodes of the FP-tree. 
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To construct FP-tree of node n, the FP-growth method first finds all the frequent 
items in fre_tail(ji) by an initial scan of the database or the headirif s conditional 
pattern base that comes from FP-tree of its parent node. And then these items are 
inserted into the header table in the order of items in fre Jailin'). In the next and the 
last scan, frequent itemset which is subset of the tail are inserted into the FP-tree as a 
branch. If a new itemset shares a prefix with another itemset that is already in the 
tree, then the new itemset will share the branch that representing the common prefix 
with the existing itemset. For example, for the database and min_sup shown in Fig- 
ure2 (a), the FP-tree of root and itemset {/} is shown as Figure2 (b) and (c). 




Fig. 2. Examples of FP-tree 



FPMax* uses an array for each node along with the FP-tree to avoid the first scan 
of the conditional pattern bases. For each 2-itemsets {a.b} in frequent extensions 
itemset, an array entry is used to store the support of head{ri)VJ {a,b}, then when 
extending the tree from a node to one of its children, we can build the header of the 
children’ FP-tree according to the array, and avoid scanning the FP-tree of current 
node again. 

Considering a given MFI M at node n in the depth-first MFI mining, if we have 
headin) VJfre_ta.il in) C M, then all the children of n will not be considered according 
to lemma 1. This is the superset frequency pruning, also called lookaheads in [5]. 
Lookaheads needs to access some information in discovered MFI relevant to current 
node for pruning. FPMax* uses another FP-tree (MFI-tree) to map the need. The 
differences between the MFI-tree and the FP-tree of the same node are as follows: 
first, the nodes do not record frequency information, but they store the length of the 
itemset represented by the path from the root to the current node. Second, for each 
itemset S represented by a path, headin) fj S’ is subset of a certain discovered MFI. In 
addition, when considering an offspring node of a node, the MFI-tree of the node will 
be updated as soon as a new MFI is found. Figure3 shows several examples of MFI- 
tree. 



3 Mining Maximal Frequent Itemsets by FIMfi 

In this section, we discuss our algorithm FIMfi in details and explain why it is faster 
than some previous schemes. 
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3.1 Pruning Techniques 

Subset Infrequency Pruning: Supposed n is a node in the search space tree, then for 
each item x in con_tail(n ) that is possible to become an item in fre_tail(n), we need to 
compute the support of the itemset head(n)tj {x}. If sup(head(n)VJ {x}) < min_sup, 
then we don’t add it into f re Jail(n) and the node identified by itemset head(n) U {x} 
will not be considered any more. This is based on lemma 2: all itemsets that is super- 
set of head(n ) kJ {x} are not frequent. 

Superset Frequency Pruning: Superset frequency pruning is also called lookaheads 
pruning. Considering a node n, if itemset head(n) tj fre_tail(n) is frequent, then all 
the children node of n should be pruned (lemma 1). 

There are two existing methods for determining whether the itemset head(n ) kJ 
fre_tail{n) is frequent. The first is to count the support of head(n) tJ frejailin) di- 
rectly, this method is normally used in an bread-first algorithms such as in MaxMiner. 
The second one is to check if a superset of head{n) tjfrejail(n) has been already in 
the discovered MFIs. It is commonly used by the depth-first MFI algorithms 
[4,7,9,10]. Also there are some other techniques, such as LMFI and MFI projection 
that is used to reduce the cost of checking. For example, in the MFI-tree situation, we 
can just check if a superset of frejail(n) can be found in all conditional pattern bases 
of head(n), and then finish the superset checking. Here we propose a new way to do 
lookaheads pruning based on FP-tree. For a given node, we can get all the conditional 
pattern bases of the head(n) from the FP-tree of its parent node, and then our algo- 
rithm tries to find a superset of frejailiyi) in a collection of conditional pattern bases, 
and the last items’ counts of these bases are no less than minimum support. If we find 
one, S, then we know head(n) kJ S being frequent, so head(n) kJ fre_tail(n) is fre- 
quent based on lemma 2. For example, when considering itemset {b}, the / re Jail of 
{£>} is \a,c], there is a conditional pattern base of { b } as a3,c3 (Figure2 (b)), then we 
know { b,c,a ] frequent, all the children of { b } will be pruned. If FIMfi finds a super- 
set of frejail(n) in FP-tree and headiri) kJ frejailiri) is an undiscovered MFI, FIMfi 
needs to update MFI-trees with head(n) k J fre_tail(ii) as described before. 

In addition, we also do superset frequency pruning with itemset head(ji) kJ 
con Jailin'). Before generating frejailiri) from con Jailin'), our algorithm will check 
if there is a superset of conjail{n) in FP-tree, this is because our scheme will use a 
very simple and fast method to do the superset checking (see section 3.2). 
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Parent Equivalence Pruning: Also the FIMfi uses the PEP for its efficiency. As an 
example, taking any item x from fre_tail{n), then there is supfheadfn) kJ \x}) = 
supfheadfn)). So if any frequent itemset Z, which contains Y but does not contain x, 
has the frequent superset ZU {x}. Since we only want to get MFI, it is not necessary 
to count itemsets which contain Y but do not contain x. Therefore, we can move item 
x from frejtailfn ) to headfn). From the experiment result we find that the PEP can 
greatly reduce the number of FP-trees’ comparing to the FPMax*. 

3.2 Superset Checking 

As discussed before, superset checking is a main operation in lookaheads pruning. 
This is because that each new MFI needs to be checked before being added into the 
MFIs. MaxMiner needs scan all the discovered MFIs, and tries to map headfn ) kJ 
fre_tail{n) item by item for each discovered MFI. Though GenMax uses LMFI to 
store all the relevant MFIs, it also needs mapping item by item. As for FPMax*, it 
only needs map f re _tailfn) item by item for all conditional pattern bases of headfn ) in 
MFI-tree. Our simple but fast superset checking method of head(n) kJ fre_tail(n) is 
based on the lemmas as follows: 

Lemma 3: If there is one conditional pattern base of head(n) in MFI-tree and its 
length is equal to the length of fre_tailfn), then headfn ) kJ fre_tail(n) is frequent. 

Proof: Let S be the itemset represented by the base, then head(n) kJ S is frequent. 
And for each item x in S, supfheadfn) kJ {x}) > min_sup , x G fre_tail(n), S Cl 
fre_tailfn). For the bases of same length, there is S = frejtailfn). Flence, we obtain the 
lemma. 

Lemma 4: If there is one conditional pattern base of headfn) in MFI-tree and its 
length is equal to the length of conjailfri), then headfn ) kJ conjailfn ) or is frequent. 

Proof: Let S be the itemset represented by the base, then headfn) kJ .S’ is frequent. 
Since conjailfn ) includes all possible extensions of headfn), there is S con Jail fn). 
For the bases of same length, there is S - conjailfn ). Hence, we obtain the lemma. 

Lemma 5: Suppose y is a conditional pattern base of headfn) in FP-tree. If the 
counter associated with the last item of y is no less than min_su.p, and the length of y 
is equal to the length of frejailfn), then headfn) kJ fre_tail(n) is frequent. 

Proof: Similar as Lemma3. 

Lemma 6: Suppose y is a conditional pattern base of headfn) in FP-tree. If the 
counter associated with the last item of y is no less than min_su.p, and the length of y 
is equal to the length of conjtailfn), then headfn) kJ con_tailfn) is frequent. 

Proof: Similar as Lemma4. 

According to lemma 3 and lemma 4, the superset checking needs not to map item 
by item, and can just be done by checking the length of itemsets. Here the level of the 
last item in the base can be used as the length of the base. For more efficient lengths 
checking, the only change of FIMfi for the MFI-tree is storing the node links of items 
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to the header table in the decreasing order of the bases’ level. Now the superset 
checking is very simple for it only needs to check the length of two itemsets. 

Similarly, the superset checking based on FP-tree can also be simple according to 
lemma 5 and lemma 6. In this situation, we add a level to each node of the FP-tree, 
with the level representing the length of the path from the node in question to root. 
And the node links, whose counts are no less than min_sup , are stored in decreasing 
order of the levels. The example is shown in Figure2 (d). Therefore, this superset 
checking is also simple for it needs only to check the length of two itemsets. Let’s 
revisit the example in section 3.1, when doing the superset checking of {b} kJ {a,c}, 
we need only compare the length of the conditional pattern base a3:c3 to the length of 
itemset {a,c}. 



3.3 Item Ordering Policy 

Item ordering policy firstly appears in [5], and is used by almost all the follow MFI 
algorithms for it can increase the effectiveness of superset frequency pruning. As we 
know, items with higher frequency are more likely to be members of lone frequent 
itemsets and subset of some discovered MFIs. As for node n, after fre_tail(n) is gen- 
erated and before extending to the children, the traditional scheme can sort the items 
at the tail in the decreasing order of sup(head{n)(J {a}) (aG fre_tail(n)). This makes 
the most frequent items appear in more itemsets that are frequent extensions of some 
nodes «’s offspring. Therefore, there will be more such pruned offspring nodes. In 
general, this type of item order policy works better in lookaheads by scanning the 
database to count the support of headin) f re _tail{n) in breath-first algorithms, such 
as in MaxMiner. All the recently proposed depth-first algorithms do the superset 
checking instead to implement the lookaheads pruning, for the counting support of 
head(n ) kJ fre_tail(n) costs high in depth-first policy. 

Since superset checking of FIMfi is based on MFI-tree and/or FP-tree, we try to 
find an item ordering policy to make use of the information of MFI-tree and/or FP- 
tree. As we know, if S is a subset of tail, and head(n) kJ S is frequent, then we can 
prune the nodes identified by itemset head(n) kJ s (s Cl .S'), this is because the itemsets 
corresponding to the nodes and their offspring are not maximal (lemma 1). Based on 
FP-tree and MFI-tree, when a policy can let S be the maximal subset of fre_tail(n), 
we can achieve maximal pruning at the node in question. 

Supposed there are two itemsets Si and S 2 . Si is represented by the conditional pat- 
tern base whose length is maximal in MFI-tree. S 2 is represented by the conditional 
pattern base, whose length is maximal among a collection of conditional pattern bases 
in FP-tree, here the last item’s count of these bases is no less than min_sup. Let S be 
the longest one of Si and S 2 , and we put the items in S to the head of fre_tail(n), then 
we can attain the maximal pruning. For example, when considering the node n identi- 
fied by {e}, we know fre_tail ( n ) ={a,c,b},S i=& and S 2 ={a,c} as in Figure2(b), then 
the sorted items in f re Jailin') is in sequence of a,c,b , the old decreasing order of sup- 
ports is b,a,c. FPMax* using the old decreasing order policy has to build FP-trees for 
nodes {e}, {<?,«}, and { e, c } , but FIMfi with the new order policy only need to build 
FP-trees for nodes {e} and {e,a\. Similarly, when considering the node {<:/}, we know 
fre_tail(n)={a,c,b}, Si={a,c} as in Figure3(b) and S 2 = < t>, the sorted items in 
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fre_tail(ri) is in sequence of a,c,b, the old 
decreasing order of supports is b,a,c. The 
experiments results of these two policies 
will be illustrated in section 4 for 
comparison purpose. 

Furthermore, for the items in 
fre_tail(ji)-S, we also sort them in the 
decreasing order of sup(head(n) fj { _y } ) 
(x£ fre_tail{n)-S). 

3.4 Optimizations 

FIMfi uses the same array technique for 
counting frequency as the one in 
FPMax*, but FIMfi doesn’t count the 
whole triangle array as FPMax* do. 
Suppose, that at node n, sorted 
fre_tail(n) is q, i 2 , ... , ... ,/ m , with 

that head(n)KJ { q, i 2 , ... , q} is frequent. 
When we extend the nodes 
corresponding to the items less thanz /+1 , 

the superset checking will return true 
and those nodes will be pruned. So, for 
the 2-itemsets which are subsets of {q, 
i 2 ,..., i\ } , the corresponding cells will not 
be used any more. Therefore, FIMfi will 
not count those cells when building the 
array. By this way, FIMfi costs less than 
FPMax* does when counting the array. 
And it is obvious that the bigger the l is, 
the more counting time saved. 

We also use the memory management 
described in [10] to reduce the time 
consumed in allocating and deallocating 
space for FP-trees and MFI-trees. 



procedure: FIMfi Algorithm 
Input: 

n: a node in search space tree that associated with a head 
itemset/t, a FP-tree , a MFI-tree, and an array. M-trees: 
MFI-trees of all ancestor nodes of/t 

(1) For each item* from end to beginning in header of n.FP-tree 

(2) h'=h U {x} hi identifies n' 

(3) if a- is not the end item of the header 

(4) \f($uperset_ checking (con_tail(n ')jt MFI-tree ) return 

(5) ifi,superset_checking(con_tail(n ')jt FP-tree) 

(6) insert /t 'U con_tail(n r ) into M-trees return 

(7) if n.array is not null 

(8) fre_tail(n') = {frequent items for x in n.array } 

(9) else 

(10) fre_tail(n r ) = {frequent items in conditional 

pattern base oft' } 

(11) Pels = {items whose count equal to the support of h'} 

(12) if{superset_ checking fre_ tail(n '), n. MFI-tree) 

(13) if the number of items beforex in the header 

is | fre_tail(n ^1 return 

(14) else continui 

(15) if(!iuperset_checkingfre_tail(n'), n. FP-tree) 

(1 6) insert h 'U fre_tai!(n ') into M-trees 

(17) if the number of items beforex in the header 

is | fre_taitgn ')| return 

(18) insertfre taU(n’) into n.MFI-tree continue 

(19) h'= h'V Pels , fre Jail(n') = fre_tail(n') -Pels 

(20) sort the items infre tail(n’) 

(21) construct the FP-tree of>»' 

(22) if(!iuperset_checkingfre_tail(n'), FP-tree) 

(23) insert h 'U fre_tail{n ') into M-trees 

(24) if the number of items beforex in the header 

is \fre_tai!(n')\ return 

(25) insert/»r_ /«//(«') into n.MFI-tree continue 

(26) construct the MFI-tree ofri' 

(27) M-trees = M-trees U {#» . MFI-tree } 

(28) call FIMfifi' M-trees) 

Fig. 4. Pseudo-code of Algorithm FIMfi 



3.5 FIMfi 

Based on section 3. 1-3.4, here we show the pseudo-code of FIMfi in Figure 4. In each 
call procedure, each newly found MFI may be used in superset checking for ancestor 
nodes of the current node, so we use a parameter called M-trees to access MFI-tree of 
the ancestor nodes. And when the top call (FIMfi(root, )) is over, all the MFIs to be 
mined are stored in the MFI-tree of root in the search space tree. 

From line (4) to line (6), FIMfi does superset frequency pruning for itemset 
con_tail(n). When x is the end item of the header, there is no need to do the pruning, 
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for the pruning has already been done by the procedure calling current one in line 
(12) and/or line (22). Lines from (7) to (10) use the optimization array technique. The 
PEP technique is used in line (11) and line (19). The superset frequency pruning for 
i te m sets frejta il(ir ) is done in lines from (12) to (18), when the condition at line (17) 
is true, all the children nodes of n are pruned and fre_tail(n’) need not to be inserted 
into n.MFl-tree any more. Line (20) uses our novel item ordering policy. Line (21) 
builds a new FP-tree: n’ .FP -tree. Lines from (22) to (25) do another superset fre- 
quency pruning for fre_tail(n’) in the tree. The return statements in line (4), (6), (13), 
(17) and (24) mean that all the children nodes after n’ of n are pruned there. And the 
continue statements in line (14), (18) and (25) tell us that node n ' will be pruned, then 
we can go to consider the next child of n. After the constructing of n’ .FP-tree and 
n’.MFI-tree and the updating of M-trees, FIMfi will be called recursively with the 
new node n ’ and the new M-trees. 

Note that the algorithm doesn't employ single path trimming used in FPMax* and 
AFOPT. If, by constructing n' . FP-tree, we can find out that n’. FP-tree only has a 
single path, the superset checking at line (20) will return true, there will be a superset 
frequency pruning instead of a single path trimming. 



4 Experimental Evaluations 

In the first Workshop on Frequent Itemset Mining Implementations (FIMI'03) [II], 
which took place at ICDM’03 (The Third IEEE International Conference on Data 
Mining), there are several recently presented algorithms that are good for mining 
MFI, such as FPMax*, AFOPT, Mafia and etc, we now present the performance 
comparisons of our FIMfi with them. All the experiments were conducted on 2.4 
GHZ Pentium IV with 1024 MB of DDR memory running Microsoft Windows 2000 
Professional. The codes of other four algorithms were downloaded from [12] and all 
codes of the five algorithms were complied using Microsoft Visual C++ 6.0. Duo to 
the lack of space, only the results for three real dense datasets and one real sparse 
dataset are shown here. The datasets we used are also selected from all the 11 real 
datasets of FIMI’03 [12], they are BMS-WebView-2 (sparse), Connect, Mushroom 
and Pumsb_star, and their data characteristics can be found in [11]. 



4.1 Comparison of FP-Trees’ Number 

The item ordering policy and PEP technology are the main improvement of FIMfi. 
To test their performance in pruning, we build two sub algorithms: FIMfi-order and 
FIMfi-pep here. Comparing with FIMfi, FIMfi-order just doesn’t use PEP for prun- 
ing, and FIMfi-pep discards our novel item ordering policy along with the optimiza- 
tion array technique. 

We take FPMax* as the benchmark algorithm, because it is also an MFI mining 
algorithm based on FP-tree and MFI-tree which does the MFI Mining best for almost 
all the datasets in FIMI’03 [11]. 

The numbers of FP-tree created by the four algorithms are shown in Figure 5. On 
the datasets Mushroom, Connect and Pumsb_star FIMfi-order and FIMfi-pep both 
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generate less than half number of the FP-trees than that of FPMax*. The combination 
of the ordering policy and PEP into FIMfi creates the least number of FP-trees in the 
four algorithms. In fact, at the lowest support of Mushroom, FPMax* creates more 
than 3 times number of FP-trees than FIMfi does. 

Note that in Figure 5, there is no result of BMS-WebView-2, it is because that all 
the four algorithms generate only one tree for BMS-WebView-2, then we omit it. 
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Figure5(b) Connect 
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Figure5(c) Pumsb_star 
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4.2 Performance Comparisons 

The performance comparisons of FIMfi, FPMax*. AFOPT and Mafia on sparse data 
BMS-WebView-2 are shown in Figure 6. FIMfi is faster than AFOPT at the higher 
supports that are no less than 50%, and FPMax* is always defeated by AFOPT at not 
only lower but also higher supports. FIMfi outperforms FPMax* about 20% to 40% 
at all supports and Mafia more than 20 times. 
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Fig. 6. Performance on Sparse Datasets 



Figure 7 gives the results of comparison the four algorithms on dense data. For all 
supports on dense datasets, FIMfi has the best performance. FIMfi runs around 40% - 
%60 faster than FPMax* on all of the dense datasets. AFOPT is the slowest algorithm 
on Mushroom and Pumsb_star and runs from 2 to 10 times worse than FIMfi on all of 
the datasets across all supports. Mafia is the slowest algorithm on Connect, it runs 
between 2 to 5 times slower than FIMfi on Mushroom and Connect across all sup- 
ports. On Pumsb_star, Mafia is outperformed by FIMfi for all the supports though it 
outperforms FPMax* at lower supports. 
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5 Conclusions 

Different from the traditional item ordering policy in which the items are sorted on 
the decreasing order of supports, this paper introduces a novel item ordering policy 
based on FP-tree and MFI-tree. The policy can guarantee maximal pruning of each 
node in the search space tree, and then greatly reduces the number of FP-trees cre- 
ated. The experimental comparison of FP-trees’ number reveals that FIMfi will gen- 
erate less than half number of FP-trees than the traditional one does for dense data- 
sets. 

We have found a simple method for fast superset checking. The method simplifies 
the superset checking to check only the equivalence of two integral, therefore makes 
the cost of superset checking less. 

Several old and new pruning techniques are integrated into FIMfi. Among the new 
ones, the superset frequency pruning based on FP-tree is first introduced and makes 
the cutting of search space more efficiently. The PEP technique used in FIMfi greatly 
reduces the number of FP-tree created comparing with FPMax* by experimental 
results in section 4. 1 . 

In FIMfi we also present a new optimization in array technique and use the mem- 
ory management to further reduce the run time. 

Our experimental results demonstrate that FIMfi is more optimized for mining 
MFI and outperforms FPMax* by 40% averagely, and on dense data it outperforms 
AFOPT and Mafia more than 2 times to 20 times. 
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Abstract. Deploying process-driven information systems is a time-con- 
suming and error-prone task. Process mining attempts to improve this 
by automatically generating a process model from event-based data. Ex- 
isting techniques try to generate a complete process model from the data 
acquired. However, unless this model is the ultimate goal of mining, such 
a model is not always required. Instead, a good visualization of each indi- 
vidual process instance can be enough. From these individual instances, 
an overall model can then be generated if required. In this paper, we 
present an approach which constructs an instance graph for each indi- 
vidual process instance, based on information in the entire data set. The 
results are represented in terms of Event-driven Process Chains (EPCs). 
This representation is used to connect our process mining to a widely 
used commercial tool for the visualization and analysis of instance EPCs. 

Keywords: Process mining, Event-driven process chains, Workflow man- 
agement, Business Process Management. 



1 Introduction 

Increasingly, process-driven information systems are used to support operational 
business processes. Some of these information systems enforce a particular way 
of working. For example, Workflow Management Systems (WFMSs) can be used 
to force users to execute tasks in a predefined order. However, in many cases 
systems allow for more flexibility. For example transactional systems such as 
ERP (Enterprise Resource Planning), CRM (Customer Relationship Manage- 
ment) and SCM (Supply Chain Management) are known to allow the users to 
deviate from the process specified by the system, e.g., in the context of SAP R/3 
the reference models, expressed in terms of Event-driven Process Chains (EPCs, 
cf. [13, 14, 19]), are only used to guide users rather than to enforce a particular 
way of working. Operational flexibility typically leads to difficulties with respect 
to performance measurements. The ability to do these measurements, however, 
is what made companies decide to use a transactional system in the first place. 

To be able to calculate basic performance characteristics, most systems have 
their own built-in module. For the calculation of basic characteristics such as the 
average flow time of a case, no model of the process is required. However, for more 
complicated characteristics, such as the average time it takes to transfer work 
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from one person to the other, some notion of causality between tasks is required. 
This notion of causality is provided by the original model of the process, but 
deviations in execution can interfere with causalities specified there. Therefore, in 
this paper, we present a way of defining certain causal relations in a transactional 
system. We do so without using the process definition from the system, but 
only looking at a so called process log. Such a process log contains information 
about the processes as they actually take place in a transactional system. Most 
systems can provide this information in some form and the techniques used to 
infer relations between tasks in such a log is called process mining. 

The problem tackled in this paper has been inspired by the software package 
ARIS PPM (Process Performance Monitor) [12] developed by IDS Scheer. ARIS 
PPM allows for the visualization, aggregation, and analysis of process instances 
expressed in terms of instance EPCs (i-EPCs). An instance EPC describes the 
the control-flow of a case, i.e. , a single process instance. Unlike a trace (i.e. , a se- 
quence of events) an instance EPC provides a graphical representation describing 
the causal relations. In case of parallelism, there may be different traces having 
the same instance EPC. Note that in the presence of parallelism, two subsequent 
events do not have to be causally related. ARIS PPM exploits the advantages 
of having instance EPCs rather than traces to provide additional management 
information, i.e., instances can be visualized and aggregated in various ways. In 
order to do this, IDS Scheer has developed a number of adapters, e.g., there is an 
adapter to extract instance EPCs from SAP R/3. Unfortunately, these adapters 
can only create instance EPCs if the actual process is known. For example, 
the workflow management system Staffware can be used to export Staffware 
audit trails to ARIS PPM (Staffware SPM, cf. [20]) by taking projections of 
the Staffware process model. As a result, it is very time consuming to build 
adapters. Moreover, the approaches used only work in environments where there 
are explicit process models available. 

In this paper, we do not focus on the visualization, aggregation, and analysis 
of process instances expressed in terms of instance EPC or some other notation 
capturing parallelism and causality. Instead we focus on the construction of 
instance graphs. An instance graph can be seen as an abstraction of the instance 
EPCs used by ARIS PPM. In fact, we will show a mapping of instance graphs 
onto instance EPCs. Instance graphs also correspond to a specific class of Petri 
nets known as marked graphs [17], T-systems [9] or partially ordered runs [8, 10]. 
Tools like VIPTool allow for the construction of partially ordered runs given an 
ordinary Petri net and then use these instance graphs for analysis purposes. In 
our approach we do not construct instance graphs from a known Petri net but 
from an event log. This enhances the applicability of commercial tools such as 
ARIS PPM and the theoretical results presented in [8, 10]. The mapping from 
instance graphs to these Petri nets is not given here. However, it will become 
clear that such a mapping is trivial. 

In the remainder of this paper, we will first describe a common format to store 
process logs in. Then, in Section 3 we will give an algorithm to infer causality at 
an instance level, i.e. a model is built for each individual case. In Section 4 we 
will provide a translation of these models to EPCs. Section 5 shows a concrete 
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example and demonstrates the link to ARIS PPM. Section 6 discusses related 
work followed by some concluding remarks. 



2 Preliminaries 

This section contains most definitions used in the process of mining for instance 
graphs. The structure of this section is as follows. Subsection 2.1 defines a process 
log in a standard format. Subsection 2.2 defines the model for one instance. 



2.1 Process Logs 

Information systems typically log all kinds of events. Unfortunately, most sys- 
tems use a specific format. Therefore, we propose an XML format for storing 
event logs. The basic assumption is that the log contains information about spe- 
cific tasks executed for specific cases (i.e., process instances). Note that unlike 
ARIS PPM we do not assume any knowledge of the underlying process. Ex- 
perience with several software products (e.g., Staffware, InConcert, MQSeries 
Workflow, FLOWer, etc.) and organization-specific systems (e.g., Rijkswater- 
staat, CJIB, and several hospitals) show that these assumptions are justified. 

Figure 1 shows the schema definition of the XML format. This format is sup- 
ported by our tools, and mappings from several commercial systems are avail- 
able. The format allows for logging multiple processes in one XML file (cf. ele- 
ment “Process”). Within each process there may be multiple process instances 
(cf. element “Processlnstance”). Each “Processlnstance” element is composed 
of “Audit TrailEntry” elements. Instead of “AuditTrailEntry” we will also use 
the terms “log entry” or “event” . An “AuditTrailEntry” element corresponds to 
a single event and refers to a “WorkflowModelElement” and an “EventType”. 
A “WorkflowModelElement” may refer to a single task or a subprocess. The 
“EventType” is used to indicate the type of event. Typical events are: “sched- 
ule” (i.e., a task becomes enabled for a specific instance), “assign” (i.e., a task 
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Fig. 1 . XML schema for process logs. 
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instance is assigned to a user), “start” (the beginning of a task instance), “ com- 
plete” (the completion of a task instance). In total, we identify 12 events. When 
building an adapter for a specific system, the system-specific events are mapped 
on these 12 generic events. 

As Figure 1 shows the “WorkflowModelElement” and “EventType” are manda- 
tory for each “ Audit TrailEntry” . There are three optional elements “Data”, 
“Timestamp”, and “Originator”. The “Data” element can be used to store data 
related to the event of the case (e.g., the amount of money involved in the trans- 
action). The “Timestamp” element is important for calculating performance 
metrics like flow time, service times, service levels, utilization, etc. The “Origi- 
nator” refers to the actor (i.e., user or organization) performing the event. The 
latter is useful for analyzing organizational and social aspects. Although each 
element is vital for the practical applicability of process mining, we focus on 
the “WorkflowModelElement” element. In other words, we abstract from the 
“EventType”, “Data”, “Timestamp”, and “Originator” elements. However, our 
approach can easily be extended to incorporate these aspects. In fact, our tools 
deal with these additional elements. However, for the sake of readability, in this 
paper events are identified by the task and case (i.e., process instance) involved. 

Table 1 . A process log. 



case identifier 


task identifier 


case 1 


task S 


case 2 


task S 


case 1 


task A 


case 1 


task B 


case 2 


task B 


case 2 


task A 



Table 1 shows an example of a small log after abstracting from all elements 
except for the “WorkflowModelElement” element (i.e., task identifier). The log 
shows two cases. For each case three tasks are executed. Case 1 can be described 
by the sequence SAB and case 2 can be described by the sequence SB A. In the 
remainder we will describe process instances as sequences of tasks where each 
element in the sequence refers to a “WorkflowModelElement” element. A process 
log is represented as a bag (i.e., multiset) of process instances. 

Definition 2.1. (Process Instance, Process Log) Let T be a set of log 

entries, i.e., references to tasks. Let T + define the set of sequences of log entries 
with length at least 1. We call a £ T + a process instance (i.e., case) and W £ 
T + — > IN a process log. 

If a = t\t2 . . . t n £ T + is a process instance of length n, then each element 
ti corresponds to “Audit TrailEntry” element in Figure 1. However, since we 
abstract from timestamps, event types, etc., one can think of ti as a reference to 
a task. \a\ = n denotes the length of the process instance and cq the i-tli element. 
We assume process instances to be of finite length. W £ T + — > IN denotes a 
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bag, i.e., a multiset of process instances. W(a) is the number of times a process 
instance of the form a appears in the log. The total number of instances in a bag 
is finite. Since W is a bag, we use the normal set operators where convenient. 
For example, we use a £ lb as a shorthand notation for W(a) > 0. 

2.2 Instance Nets 

After defining a process log, we now define an instance net. An instance net is 
a model of one instance. Since we are dealing with an instance that has been 
executed in the past, it makes sense to define an instance net in such a way that 
no choices have to be made. As a consequence of this, no loops will appear in an 
instance net. For readers familiar with Petri nets it is easy to see that instance 
nets correspond to “runs” (also referred to as occurrence nets) [8] . 

Since events that appear multiple times in a process instance have to be 
duplicated in an instance net, we define an instance domain. The instance domain 
will be used as a basis for generating instance nets. 

Definition 2.2. (Instance domain) Let a be a process instance such that 
a = t\t 2 ■ ■ ■ t n £ T + , i.e., |cr| = n. We define D a = {1 . . . n} as the domain of a. 

Using the domain of an instance, we can link each log entry in the process 
instance to a specific task, i.e., i £ D a can be used to represent the i-th element 
in cr. In an instance net, the instance a is extended with some ordering relation 
to reflect some causal relation. 

Definition 2.3. (Instance net) Let N = (er, 3 CT ) such that a is a process 
instance. Let D a be the domain of a and let 3 CT be an ordering on D a such that: 

— -\ a is irreflexive, asymmetric and acyclic, 

— ViJeA, (i < j =r- j *), 

— Vi,j £ D<j(i 3 CT j =>/9fc g £) <T (* 3+ k A k 3+ j), where 3+ is the smallest 
relation satisfying: i 3+ j if and only if i 3 CT j or 3 *,(z ~\ a k A k 3+ j) 

— Vi, j £ D a (U = tj => (i 3+ j) V ( j 3+ i )) 

We call N an instance net. 

The definition of an instance net given here is rather flexible, since it is defined 
only as a set of entries from the log and an ordering on that set. An important 
feature of this ordering is that if i 3 j then there is no set {Aq, ^ 2 , . . . , k n } such 
that i 3 ki, k\ 3 Aq, . . . , k n 3 j. Since the set of entries is given as a log, and an 
instance mapping can be inferred for each instance based on textual properties, 
we only need to define the ordering relation based on the given log. In Section 3.1 
it is shown how this can be done. In Section 4 we show how to translate an 
instance net to a model in a particular language (i.e., instance EPCs). 

3 Mining Instance Graphs 

As seen in Definition 2.3, an instance net consists of two parts. First, it requires 
a sequence of events a £ T + as they appear in a specific instance. Second, an or- 
dering 3 on the domain of a is required. In this section, we will provide a method 
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that infers such an ordering relation on T using the whole log. Furthermore, we 
will present an algorithm to generate instance graphs from these instance nets. 

3.1 Creating Instance Nets 

Definition 3.1. (Causal ordering) Let W be a process log over a set of log 
entries T, i.e., W £ T + — > IN. Let b £ T and c £ T be two log entries. We define 
a causal ordering — on W in the following way: 

— b >w c if and only if there is an instance <j and i £ D a \ { | rr| } such that 
a £ W and a* = b and cr,; + i = c, 

— bAwC if and only if there is an instance a and i £ D a \ { |cr| — 1 , |cr| } such 
that cr £ W and = <jj+2 = b and oy + 1 = c and b c and not b >w b, 

— b — c if and only if b >w c and (c f-w b or bA\yc or cAwb), or b = c. 

The basis of the causal ordering defined here, is that two tasks A and B have 
a causal relation A — > B if in some process instance, A is directly followed by 
B and B is never directly followed by A. However, this can lead to problems 
if the two tasks are in a loop of length two. Therefore, A — » B also holds if 
there is a process instance containing ABA or BAB and A nor B can directly 
succeed themselves. If A directly succeeds itself, then A — > A. For the example 
log presented in Table 1, T = {S,A,B} and causal ordering inferred on T is 
composed of the following two elements S —>w A and S —>w B. 

By defining the relation, we defined an ordering relation on T . This 
relation is not necessarily irreflexive, asymmetric, nor acyclic. This — relation 
however can be used to induce an ordering on the domain of any instance a that 
has these properties. This is done in two steps. First, an asymmetric order is 
defined on the domain of some cr. Then, we prove that this relation is irreflexive 
and acyclic. 

Definition 3.2. (Instance ordering) Let W be a process log over T and let 
fj £ W be a process instance. Furthermore, let —>w be a causal ordering on 
T. We define an ordering on the domain of cr, D a in the following way. For 
all i,j £ D a such that i < j we define i j if and only if a, — <Jj and 
fli<k<j(&i &k) Or ^ j ) ■ 

The essence of the relation defined here is in the final part. For each entry 
within an instance, we find the closest causal predecessor and the closest causal 
successor. If there is no causal predecessor or successor then the entry is in 
parallel with all its predecessors or successors respectively. It is trivial to see that 
this can always be done for any process instance and with any causal relation. 

In the example log presented in Table 1 there are two process instances, case 
1 and case 2. From here on, we will refer to case 1 as o\ and to case 2 as o^- We 
know that ay = SAB and that D ai = {1, 2, 3}. Using the causal relation — > the 
relation is inferred such that 1 2 and 1 S„ 1 3. For a 2 this also applies. 

It is easily seen that the ordering relation is indeed irreflexive and asym- 
metric, since it is only defined on i and j for which i < j. Therefore, it can easily 
be concluded that it is irreflexive and acyclic. Furthermore, the third property 
holds as well. Therefore we can now define an instance net as (cr, A a ). 



368 



B.F. van Dongen and W.M.P. van der Aalst 



3.2 Creating Instance Graphs 

In this section, we present an algorithm to generate an instance graph from an 
instance net. An instance graph is a graph where each node represents one log 
entry of a specific instance. These instance graphs can be used as a basis to 
generate models in a particular language. 

Definition 3.3. (Instance graph) Consider a set of nodes N and a set of 
edges E C N x N. We call G = (N, E) a an instance graph of an instance net 
(a, y a ) if and only if the following conditions hold. 

1. N = D a U {0, \D a \ + 1} is the set of nodes. 

2. The set of edges E is defined as E = E re i U E initial U Ef ina i, where 
Erei = {(ni,n 2 ) e N x N\(m n 2 )} and 

Einitiai — ^ Ef x N | fi ni (n \ A a u)} and 

Efinai = {(n, \N\ ~ 1) € N x N\ jB ni {n ni)} 

An instance graph as described in Definition 3.3 is a graph that typically 
describes an execution path of some process model. This property is what makes 
an instance graph a good description of an instance. It not only shows causal 
relations between tasks but also parallelism if parallel branches are taken by 
the instance. However, choices are not represented in an instance graph. The 
reason for that is obvious, since choices are made at the execution level and do 
not appear in an instance. With respect to these choices, we can also say that 
if the same choices are made at execution, the resulting instance graph is the 
same. Note, that the fact that the same choices are made does not imply that 
the process instance is the same. Tasks that can be done in parallel within one 
instance can appear in any order in an instance without changing the resulting 
instance graph. 

For case 1 of the example log of Table 1 the instance graph is drawn in 
Figure 2. Note that in this graph, the nodes 1,2 and 3 are actually in the domain 
of tr i and therefore, they refer to entries in Table 1. It is easily seen that for 
case 2 this graph looks exactly the same, although the nodes refer to different 
entries. 

In order to make use of instance graphs, we will show that an instance graph 
indeed describes an instance such that an entry in the log can only appear if all 
predecessors of that entry in the graph have already appeared in the instance. 
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Fig. 2. Instance graph for <ti. 
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Definition 3.4. (Pre- and postset) Let G = (N, E) a be an instance graph and 
let n £ N. We define • G n to be the preset of n such that • G n = {n' £ N\(n\ n) £ 
E}. We define n» G to be the postset of n such that n» G = {n' £ N\(n, n') £ E}. 

Property 3.5. (Instance graphs describe an instance) Every instance 
graph G — ( N , E) a of some process instance a describes that instance in such 
a way that for all i,j £ N holds that for all j £ * G i implies that j < i. This 
ensures that every entry in process entry a occurs only after all predecessors in 
the instance graph have occurred in cr. 

Proof. To prove that this is indeed the case for instance graph G = (N : E) a , 
we consider Definition 3.3 which implies that for “internal nodes” we know that 
( 711 , 712 ) £ E if and only if n\ >- CT 112 . Furthermore, from the definition of we 
know that n\ 712 implies that n\ < n 2 . For the source and sink nodes, it is 
also easy to show that n\ £ * G ri 2 implies that n\ < ri 2 because 0 is the smallest 
element of N while |iV| — 1 is the largest. □ 

Property 3.6. (Strongly connectedness) For every instance graph 
G = (N, E) a of some process instance cr holds that the short circuited graph 
G' = (N, E U {(|AT| — 1,0)}) is strongly connected. 1 

Proof. From Definition 3.3 we know that for all i £ D a such that there does 
not exist a j £ D a such that j i holds that (0, i) £ E. Furthermore, we know 
that for all i £ D a such that there does not exist a, j £ D a such that * j 
holds that (i, |cr| + 1) £ E. Therefore, the graph is strongly connected if the edge 
(|iV| — 1,0) is added to E. □ 

In the remainder of this paper, we will focus on an application of instance 
graphs. In Section 4 a translation from these instance graphs to a specific model 
are given. 



4 Instance EPCs 

In Section 3 instance graphs were introduced. In this section, we will present an 
algorithm to generate instance EPCs from these graphs. An instance EPC is a 
special case of an EPC (Event-driven Process Chain, [13]). For more information 
on EPCs we refer to [13,14,19]. These instance EPCs (or i-EPCs) can only 
contain AND-split and AND-join connectors, and therefore do not allow for 
loops to be present. These i-EPCs serve as a basis for the tool ARIS PPM 
(Process Performance Monitor) described in the introduction. 

In this section, we first provide a formal definition of an instance EPC. An 
instance EPC does not contain any connectors other than AND-split and AND- 
joins connectors. Furthermore, there is exactly one initial event and one final 
event. Functions refer to the entries that appear in a process log, events however 
do not appear in the log. Therefore, we make the assumption here that each 

A graph is strongly connected if there is a directed path from any node to any other 
node in the graph. 
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event uniquely causes a function to happen and that functions result in one or 
more events. An exception to this assumption is made when there are multiple 
functions that are the start of the instance. These functions are all preceded 
by an AND-split connector. This connector is preceded by the initial event. 
Consequently, all other connectors are preceded by functions and succeeded by 
events. 

Definition 4.1. (Instance EPC) Consider a set of events E, a set of functions 
F, a set of connectors C and a set of arcs AC ((£UFUC) x (£UFUC))\ 
(( E x E) U (F x F)). We call (E, F, C, A) an instance EPC if and only if the 
following conditions hold. 

1. EnF=FnC=EnC=9 

2. Functions and events alternate in the presence of connectors: V ni ,n 2 eEuF 
^(ci,c 2 )e(An(CxC))+u/(( n ii c i) 6 A A (c 2 ,n 2 ) £ A) =>■ (m £ E n 2 £ F), 
where I = {(c, c) j c £ C}. 

3. The graph (E \J F \J C, A) is acyclic. 

4. There exists exactly one event ej £ E such that there is no element n £ FUC 
such that (n, ej) £ A. We call ej the initial event. 

5. There exists exactly one event e/ £ E such that there is no element n £ FUC 
such that (e/,n) £ A. We call e/ the final event. 

6. The graph (E U F U C, AU {(e/, ej)}) is strongly connected. 

7. For each function / £ F there are exactly two elements ni, n 2 £ E U C such 
that (/, ni) £ A and (n 2 , f) £ A. Functions only have one input and one 
output. 

8. For each event e £ E/{ej, e/} there are exactly two elements n\,n 2 £ FUC 
such that (e, ni) £ A and (n 2 ,e) £ A. Events only have one input and one 
output, except for the initial and the final event. For them the following 
holds. For ej there is exactly one element n £ F U C such that (ej,n) £ A 
and for e/ there is exactly one element n £ F U C such that (n, ef) £ A. 



4.1 Generating Instance EPCs 

Using the formal definition of an instance EPC from Definition 4.1, we introduce 
an algorithm that produces an instance EPC from an instance graph as defined 
in Definition 3.3. In the instance EPC generated it makes sense to label the 
functions according to the combination of the task name and event type as they 
appear in the log. The labels of the events however cannot be determined from 
the log. Therefore, we propose to label the events in the following way. The 
initial event will be labeled “initial” . The final event will be labeled “final” . All 
other events will be labeled in such a way that it is clear which function succeeds 
it. Connectors are labeled in such a way that it is clear whether it is a split or 
a join connector and to which function or event it connects with the input or 
output respectively. 

Definition 4.2. (Converting instance graphs to EPCs) Let W be a process 
log and let G = ( Ng,Eg)<j be an instance graph for some process instance 
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cr £ W. To create an instance EPC, we need to define the four sets E, F, C 
and A. 

— The set of functions F is defined as F = {fi \ i £ D a }. In other words, for 
every entry in the process instance, a function is defined. 

— The set of events E is defined as E = {e/ 4 j fi £ F and 3j e d„ ij >-& *)} U 
{smitiai,e final} - In other words, for every function there is an event preceding 
it, unless it is a minimal element with respect to Furthermore, there is 
an initial event einitiai and a final event efi na i- 

— The set of connectors C is defined as C = C sp ut U Cj 0 i n U Cj U C/ where 
C S piit = {c( sp iitji) | fi £ F A |i * G | > 1} and 

Cjoin = {c(join,e fi ) I e/ 4 £ E A | * G i\ > 1} and 
C^ }^(split,einitial) I 1^ | ^ 1} and 

C f = { C Uoin,ef ina t) I I *G (\Ng\ - 1)1 > 1}. 

Here, the connectors are constructed in such a way that connectors are al- 
ways preceded by a function, except in case the process starts with parallel 
functions, since then the event ei n iu a l is succeeded by a split connector. 

— The set of arcs A is defined as A = A e f U Af e U A sp u t U Aj 0 i n U Ai U Af where 

A e f = £ (F x F)} and 

Afe = £ (F x F) | (i,j) £ E G A |i * G | = 1 A | * G j\ = 1} 

-A-split — {(/i) C(split,fi ) ) ^ {F X C S plit)}^ 

{{C(split,fi)i &fj ) ^ {C split x E) | £ Eq A | *g j | = 

{ (split, fi) 5 C(join,efj )) € ( C split x Cjoin) | (Aj) £ -^g} an d 

Ajoin = {(^(jom,e/ i )i^/i) ^ {Cjoin x -^)}U 

{(./i? C(join,efj )) £ (F x (7,-oin) I (m) £ Eg A |i * G | = 1} and 

Aj = {{^initiali C( S plit,e iTl i t i a i)) £ (-® Cj)}U 

{(c( sp jii, e<nitio j),/i) € (Cj x F)|(0,i) £ F G } and 
31/ = { {^(join,e final ) ) C final ) £ (C / X F)}U 

{/i. {join, efi na i ) ) € (F X F>/)|(f, (|1V G | - 1)) £ E G ). 

It is easily seen that the instance EPC generated by Definition 4.2 is indeed 
an instance EPC, by verifying the result against Definition 4.1. 

In definitions 3.3 and 4.1 we have given an algorithm to generate an instance 
EPC for each instance graph. The result of this algorithm for both cases in the 
example of Table 1 can be found in Figure 3. In Section 5 we will show the 
practical use of this algorithm to ARIS PPM. 




Fig. 3. Instance EPC for <ri. 
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5 Example 

In this section, we present an example illustrating the algorithms described in 
sections 3 and 4. We will start from a process log with some process instances. 
Then, we will run the algorithms to generate a set of instance EPCs that can be 
imported into ARIS PPM. 

5.1 A Process Log 

Consider a process log consisting of the following traces. 

Table 2. A process log. 



case identifier 


task executions 


case 1 


Si, Aa, B 3 , Ft, Cs, Dg, H 7 , Gs, Tg 


case 2 


Si, A2, C3, B 4 , E5, Hg, F 7 , Gs, Tg 


case 3 


Si, A 2 , D 3 , Ba, C 5 , Fe, H 7 , Gs, T 9 


case 4 


Si, A 2 , E 3 , Ba, Ge, He, F 7 , Gs, Tg 


case 5 


Si, A 2 , B 3 , Da, F 3 , He, C 7 , Gs, Tg 


case 6 


Si, A 2 , B 3 , Fa, Fe, He, C 7 , Gs, Tg 


case 7 


Si, A 2 , B 3 , Fa, De, C 6 , H 7 , Gs, Tg 


case 8 


Si, A 2 , B 3 , Fa, Ee, Cg, H 7 , Gs, Tg 


case 9 


Si, A 2 , D 3 , Ca, Be, He, F 7 , Gs, Tg 


case 10 


Si, A 2 , C 3 , Ea, He, Be, TV, Gs, Tg 



The process log in Table 2 shows the execution of tasks for a number of 
different instances of the same process. To save space, we abstracted from the 
original names of tasks and named each task with a single letter. The subscript 
refers to the position of that task in the process instance. 

Using this process log, we will first generate the causal relations from Def- 
inition 3.1. Note that casual relations are to be defined between tasks and not 
between log entries. Therefore, the subscripts are omitted here. This definition 
leads to the following set of causal relations: {S — > A, A —> B, A —> C, 
A -> D, A -> E, B -> F, D -> H, E -> H, F -> G, 
C -> G, H —> G, G — > T}. 

Using these relations, we generate instance graphs as described in Section 3 
for each process instance. Then, these instance graphs are imported into ARIS 
PPM and a screenshot of this tool is presented (cf. Figure 5). 

5.2 Instance Graphs 

To illustrate the concept of instance graphs, we will present the instance graph 
for the first instance, “case 1”. In order to do this, we will follow Definition 3.2 
to generate an instance ordering for that instance. Then, using these orderings, 
an instance graph is generated. Applying Definition 3.2 to case 1 in the log 
presented in Table 2 using the casual relations given in Section 5.1 gives the 
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following instance ordering: 0 >- 1,1 >~ 2, 2 >— 3, 3^4, 4 8, 8 >- 9, 2 >~ 

5, 5^8, 2 >- 6, 6^7, 7^8, 8 >- 9, 9 >- 10. 

Using this instance ordering, an instance graph can be made as described 
in Definition 3.3. The resulting graph can be found in Figure 4. Note that the 
instance graphs of all other instances are isomorphic to this graph. Only, the 
numbers of the nodes change. 



o 




B F 





->10 



Fig. 4. Instance graph for case 1. 



For each process instance, such an instance graph can be made. Using the 
algorithm presented in Section 4 each instance can than be converted into an 
instance EPC. These instance EPCs can be imported directly into ARIS PPM for 
further analysis. Here, we would like to point out again that our tools currently 
provide an implementation of the algorithms in this paper, such that the instance 
EPCs generated can be imported into ARIS PPM directly. A screenshot of this 
tool can be found in Figure 5 where “case 1” is shown as an instance EPC. 
Furthermore, inside the boxed area, the aggregation of some cases is shown. 
Note that this aggregation is only part of the functionality of ARIS PPM. Using 
graphical representations of instances, a large number of analysis techniques is 
available to the user. However, creating instances without knowing the original 
process model is an important first step. 



6 Related Work 

The idea of process mining is not new [1,3,5-7,11,12,15,16,18,21] and most 
techniques aim at the control-flow perspective. For example, the a-algorithm 
allows for the construction of a Petri net from an event log [1,5]. However, process 
mining is not limited to the control-flow perspective. For example, in [2] we use 
process mining techniques to construct a social network. For more information 
on process mining we refer to a special issue of Computers in Industry on process 
mining [4] and a survey paper [3]. In this paper, unfortunately, it is impossible 
to do justice to the work done in this area. To support our mining efforts we 
have developed a set of tools including EMiT [1], Thumb [21], and MinSoN [2]. 
These tools share the XML format discussed in this paper. For more details we 
refer to www.processmining.org. 

The focus of this paper is on the mining of the control-flow perspective. 
However, instead of constructing a process model, we mine for instance graphs. 
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Fig. 5. ARTS PPM screenshot. 



The result can be represented in terms of a Petri net or an (instance) EPC. 
Therefore, our work is related to tools like ARIS PPM [12], Staffware SPM [20], 
and VIPTool [10] . Moreover, the mining result can be used as a basis for applying 
the theoretical results regarding partially ordered runs [8] . 

7 Conclusion 

The focus of this paper has been on mining for instance graphs. Algorithms are 
presented to describe each process instance in a particular modelling language. 
From the instance graphs described in Section 3, other models can be created as 
well. The main advantage of looking at instances in isolation is twofold. First, it 
can provide a good starting point for all kinds of analysis such as the ones imple- 
mented in ARIS PPM. Second, it does not require any notion of completeness 
of a process log to work. As long as a causal relation is provided between log 
entries, instance graphs can be made. Existing methods such as the a-algorithm 
[1,3,5] usually require some notion of completeness in order to rediscover the 
entire process model. The downside thereof is that it is often hard to deal with 
noisy process logs. In our approach noise can be filtered out before implying the 
causal dependencies between log entries, without negative implications on the 
result of the mining process. 

ARIS PPM allows for the aggregation of instance EPCs into an aggregated 
EPC. This approach illustrates the wide applicability of instance graphs. How- 
ever, the aggregation is based on simple heuristics that fail in the presence of 
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complex routing structures. Therefore, we are developing algorithms for the inte- 
gration of multiple instance graphs into one EPC or Petri net. Early experiments 
suggest that such a two-step approach alleviate some of the problems existing 
process mining algorithms are facing [3,4]. 
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Abstract. XML becomes increasingly important in data exchange and informa- 
tion management. Starting point for retrieving the information and integrating 
the documents efficiently is clustering the documents that have similar struc- 
ture. Thus, in this paper, we propose a new XML document clustering method 
based on similar structure. Our approach first extracts the representative struc- 
tures of XML documents by sequential pattern mining. And then we cluster 
XML documents of similar structure using the clustering algorithm for transac- 
tional data, assuming that an XML document as a transaction and the frequent 
structure of documents as the items of the transaction. We also apply our tech- 
nique to XML retrieval. Our experiments show the efficiency and good per- 
formance of the proposed clustering method. 

Keywords: Document Clustering, XML Document, Sequential Pattern, Struc- 
tural Similarity, Structural Retrieval 



1 Introduction 

XML( extensible Markup Language) is a standard for data representation and ex- 
change on the Web, and we will find large XML document collection on the Web in 
the near future. Therefore, it has become crucial to address the question of how we 
can efficiently query and search XML documents. Meanwhile, the hierarchical struc- 
ture of XML has a great influence on the information retrieval, the document man- 
agement system, and data mining[l,2,3,4]. 

Since an XML document is represented as a tree structure, one can explore the re- 
lationship among XMLs using various tree matching algorithms [5 ,6]. A closely re- 
lated problem is to find trees in a database that “match” a given pattern or query 
tree[7]. This type of retrieval often exploits various filters that eliminate unqualified 
data trees from consideration at an early stage of retrieval. The filters accelerate the 
retrieval process. Another approach to facilitating a search is to cluster XMLs into 
appropriate categories. 

We propose a new XML clustering technique based on similar structure in this pa- 
per. We first extract the representative structures of frequent patterns including hier- 
archical structure information from XML documents by the sequential pattern mining 
method[8]. And then we perform the document clustering by considering both the 
CLOPE algorithm[9] and large items[10], assuming that an XML document as a 
transaction and the extracted frequent structures from documents as the items of the 
transaction. We also apply our method to structural retrieval of XML documents in 
order to verify the efficiency of proposed technique. 
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The remaining of the paper is organized as follows. Section 2 reviews the previous 
researches related to the structure of XML documents. Section 3 describes the method 
extracting the representative structures of XML documents. In section 4, we define 
our clustering criterion using large items, and we describe about updating the cluster, 
and section 5 explains how to apply our clustering method to XML retrieval. Section 
6 shows the experiment results of clustering algorithm and the result of XML re- 
trieval, section 7 concludes the paper. 

2 Related Works 

Recently, as XML documents with various structures are increasing, it is needed to 
study the method that classifies the similar structure documents and retrieves the 
documents [3,4]. 

[11| considered XML as a tree and analyzed the similarity among the documents 
by taking account of semantics. [12] referred the necessity to manage the increasing 
XML documents and proposed the clustering method about element tags and the text 
of XML documents using k-means algorithm. 

In [3,4,13], they say that there are two kinds of structure mining technique for ex- 
tracting the XML document structure; intra-structured mining for one document and 
inter-structured mining for various documents. But the concrete algorithm is not de- 
scribed. 

[14] proposed the clustering method about the DTD based on the similarity of ele- 
ments as the way to find out the mediate DTD to integrate DTDs. But it can just be 
applied to the DTDs with the same application domain. [15] concentrated on finding 
out the common structure of the tree, but not cosidering the document clustering. [16] 
grouped trees about the same pairs of labels occurring frequently, and then finds a 
subset of the frequent trees. But the multi relational tree structure can’t be detected, 
because it is based on the label pairs. [17] proposed the method for clustering the 
XML documents using the bit map indexing, but it requires too much space for a 
large amount of documents. 

In this paper, we use the CLOPE algorithm[9] adding the notion of large items for 
document clustering. The CLOPE algorithm uses only the rate of the common items, 
not considering individual items in a cluster. Therefore, it can have some problems 
that the similarity between clusters may be higher, and it mayn’t control the number 
of clusters. In order to address this problem, we add the notion of large items about a 
cluster to CLOPE algorithm. 

3 Extracting the Representative Structure of XML Documents 

XML document has sequential and hierarchical structure of elements. Therefore, the 
orders of the elements and the elements themselves have the feature that can distin- 
guish the XML documents [11,13]. Thus, we use the sequential pattern mining that 
considers both the frequency and the order of elements. 

3.1 Element Path Sequences 

We first extract representative structures of each document based on the path from the 
root to the element about elements having content value. Figure 1 is an example XML 
document to show how to find out the representative structures from the documents. 
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<bookinventorv> 

<book> 

<title>Great Expectalions</title> 
<aulhor> 

<name>Charles Dickens</name> 
<born> 1 8 1 2</born> 

<dicd> 1 879</died> 
<nationality>English</nationality> 
</author> 

<count>10 </count> 

</book> 

</bookinventory> 



demert 


rename 


eiemert 


rename 


lector watery 


a 


name 


Cfl 


bock 


b 


bom 


<2 


title 


cl 


ded 


d3 


aihcr 


<2 


ndierdity 


c# 


cart 


C3 







Fig. 1. An XML document 



Fig. 2. Element mapping table 



We rename each element with alphabet to easily distinguish elements using the 
element mapping table, as shown in Figure 2. Based on the renamed element by Fig- 
ure 2, the element paths having contents value is represented as Figure 3, in which 
element paths are regarded as the sequences and each 
element contained in sequence is considered to be the 
items. And then we find out the frequent sequence 
structures that satisfy the given minimum support by the 
sequential pattern mining algorithm. 

3.2 The Sequential Pattern Algorithm 
to Extract The Frequent Structure 

To extract the frequent structures, we use the PrefixSpan 
algorithm[8] about Figure 3. To do this, we define the 
frequent structure minimum support as follows. 

Definition 1 (Frequent Structure Minimum Support). Frequent structure minimum 
support is the least frequency that satisfies the rate of the frequent structure among the 
whole paths in a document, and the path sequences that satisfy this condition are the 
frequent structures. The formula of this is as follows. 

FFMS = frequent structure rate * the number of path of the whole documents 

( 0 < frequent structure rate < 1 ) 

If the frequent structure rate is 0.2, FFMS of sequence set of Figure 3 is 2 (0.2 * 6). 
And the element frequency of the length- 1 satisfying the FFMS is a: 6, b: 6, c2: 4. 
Starting from this length- 1 sequential pattern, we extract the frequent pattern struc- 
tures using the projected DB(refer to [8] for the detail algorithm). According to this 
method, the maximal frequent structure in Figure 3 is <a/b/c2>, and this path is oc- 
curred at the rate of about 66%(4/6) to the whole document. 

We also include the structures of length over the regular rate to the maximal fre- 
quent structure (e.g. the most frequent length 5 * 80% = the frequent structure length 
4) to the representative structures as the input data for clustering. The reason is that it 
can avoid frequent structures missing, in case there are various subjects in a docu- 
ment. 



X_id 


X_path 


1 


a/b/cl 


2 


a/b/c2/dl 


3 


a/b/c2/d2 


4 


a/b/c2/d3 


5 


a/b/c2/d4 


6 


a/b/c3 



Fig. 3. Element path se- 
quences 
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4 Large Item Based Document Cluster 

The frequent structures of each XML document are basic data for clustering. We 
assume the XML documents as a transaction, the frequent structures extracted from 
each document as the items of the transaction, and then we perform the document 
clustering using the notion of large items. 



4.1 A New Clustering Criterion 



The item set included all the transaction is defined as I = {ij,i 2 ,..i n }, cluster set as C = 
{C[,C 2 ..,C m }, and transaction set that represents the document as T = {tpH.t,... t k ). As 
a criterion to allocate a transaction to the appropriate cluster, we define the cluster 
allocation gain. 

Definition 2 (Cluster Allocation Gain). The cluster allocation gain is the sum of the 
ratio of the total occurrences to the individual items in every cluster. The following 
equation expresses this. 

Gain( C) = X G (Ci )x Ci ( Tr ) I = £ J } 2 x | Ci (. Tr ) \ 

j = i ,=i yy yL-i ) 

t I Ci ( Tr ) I Z I Ci {Tr ) I 



where G is the occurrence rate(H) to individual item(W) in a cluster, H = T (the total 
occurrence of the individual items) / W (the number of the individual items), G = 
T/W 2 . 

Gain is a criterion function for cluster allocation of the transaction, and the higher 
the rate of the common items, the more the cluster allocation gain. Therefore we allo- 
cate a transaction to the cluster to be the largest Gain. 

However if we use only the rate of the common items, not considering the individ- 
ual items like CLOPE, it causes some problems as follows. 



Example 1. Assume that transaction t4 = { f , c} is to be inserted, under the condition 
of the cluster Cl = {a:3, b:3, c: 1 } , C2 = {d:3, e: 1 , c:3 } including three transactions 



respectively. If t4 is allocated to Cl or C2, then Gain is 



x 4 _i 

4 2 3 2 _ _0.654. Other 

7 



while, if t4 is allocated to a new cluster, then Gain is 



7 7 2 

— x3 + — x 3 h xl 

3 2 3 2 2 2 



0.738. 



7 

Thus, t4 is allocated a new cluster by Definition 2. As you see in this example, we can 
get the considerably higher allocation grain about a new cluster, because Gain about a 



new cluster equals W/W~ . Due to this, it causes the production of many clusters 
over the regular size, so that it may reduce cluster cohesion. In order to address this 
problem, we define the large items and the cluster participation as follows. 



Definition 3 (Large Items). Item support of cluster Ci is defined as the number of the 
transactions including item ij (j <= n) in the cluster Ci, about the minimum support 

determined by the user, 0 (0 < 0 <=1). If the number of the transactions including 
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item ij in cluster Ci is over the item support, Sup = 9 * |Ci(Tr)|, the item i is the 
large item in the cluster Ci. 

Ci(L)ij = |Ci(Tr)i jeI | >= Sup 

where |Ci(Tr)| is the number of the whole transactions in Ci, |Ci(Tr)i- EI | is the number 
of the transactions including the item ij in the cluster Ci. 

Definition 4 (Cluster Participation). It is the ratio of the common items to the num- 
ber of items of transaction t k composed of the frequent structure and the large items in 
the cluster Cj. And it means the probability of transaction t k to be assigned to cluster 
Cj. We represent it as followings. 

P_Allo(t k =>C}) = I ** I nC J ( L) > a 0 , 

h (2) 

(0 < CO [ <1: minimum participation) 

|t k | is the number of the items of the transaction t k . In example 1, if there is any cluster 
that satisfies the given minimum participation about insertion of t4, our approach does 
not produce a new cluster, but allocate t4 to the cluster with maximum participation. 
Therefore, cluster participation can control the number of clusters. When CO x is small, 
the production of the cluster is suppressed. 

Definition 5 (Cluster Cohesion). The cluster cohesion(Co/t(Ci)) is the ratio of the 
large items to the whole items T(Ci) in the cluster Ci. This is calculated by the follow- 
ing formula, and if it is near 1 , it is the good quality cluster. 

CoUC\) = d ( L ) ^ , 

r (c/ ) 

Definition 6 (Inter-cluster Similarity). The inter-cluster similarity based on the large 
items is the rate of the common large items of the cluster Ci, Cj. We calculate the 
inter-cluster similarity by the following formula, and if it is near 0, it is the good clus- 
tering. 

I(OnQ)x^)l 

Sim{ Ci,Cj) = \MQ_ + Q)I (4) 

Ci(L) + Cj(L) 

where L(CiflCj) is the number of common large items in the cluster Ci and Cj, 
|L(CinCj)| is the total occurrence number of the common large items, and |L(Ci+Cj)| 
is that of the large items of cluster Ci, Cj . 

4.2 Cluster Allocation Using Difference Operation 

Once a new document is inserted, we first extract the representative structures of the 
document by a method described in section 3. And then it can easily be allocated to 
the cluster of the largest Gain through calculating the difference operation about cur- 
rent Gain as follows. 

Definition 7 (Difference Operation). The difference operation is the different Gain 
of the inserted transaction to the existing cluster. We use inserted difference 
(, diff_Gain(A + » . 
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diff_Gain(A + ) = New_Gain(C\) - 01d_Gain( Ci) 



= T'jCi ) 
W '(Ci) 2 



x (| 



d(Tr) |+1) 



T(Ci ) 
W (Ci) 2 



x | Ci (Tr ) | 



(5) 



W’ is the number of the individual items and T’ is the total number of individual 
items when the transaction is inserted. We can compute the change value of the cur- 
rent Gain by (5). 

We use difference operation, but it takes too much time to compute the Gam for all 
the clusters. Therefore, we reuse the cluster participation(Definition 4), used to con- 
trol the production of new cluster, to fast predict the allocation probability, and com- 
pute the diffjGain about the only clusters that satisfy the newly given cluster partici- 
pation CO 2 . Even though small CO , yields better clustering, it takes much time to 
perform clustering because the number of cluster to compute diffjGain increases. 

We allocate the transaction to the cluster that has the largest diffjGain. The XML 
document clustering algorithm by difference operation is shown in Figure 4. 



Insert transaction t 



while not end of the existing cluster and p_Allo( C) >= CO 2 

II (C0 2 = 0.2) 



find a cluster(Ci) maximizing diff_Gain (C) ; 

find a cluster(Cj) maximizing p_Allo( C) ; 

if diff_Gain(( C k ) > diff_Gain(CJ // new cluster C k 

if p_Allo (Cj ) >= CO x / / [ CO x= 0.5), an existing cluster Ci 
allocate t to an existing cluster Cj ; 
else allocate t to an new cluster C k ; 
else allocate t to an existing cluster Ch; 



Fig. 4. Similar structure-based XML document clustering algorithm using the difference opera- 
tion 



5 XML Retrieval Based on Clustering 

The clustering of XML documents based on similar structure can be used as a pre- 
processing for structure retrieval of XML. The reason is that it reduces the space of 
retrieval into a similar cluster. Therefore we can obtain the retrieval result fast. 

The XML retrieval based on the clustering is composed of three basic steps as fol- 
lows. 

1 . Simplify query into simple structure by using the element mapping table. 

2. Reduce the search space to the cluster of similar structure, finding the most similar 
cluster by comparing structure of the large item in each cluster with the query. 

3. Display the ranked XML documents by computing the similarity between query 
and documents in the similar cluster. 

It is necessary to compute the similarity for retrieving the XML document of simi- 
lar structure. An XML is composed of elements with hierarchical structure and it is 
represented by edges of parent-child relation in the tree structure. Therefore, in order 
to compute the structural similarity between query tree and XML documents, we 
consider both edges and paths from the root node to the node currently considering. 



A New XML Clustering for Structural Retrieval 383 



Here, path means the structural feature of elements in an XML document. To do this, 
we formulate the computing measure, edge similarity and path similarity, as follows. 

Definition 8 (Edge Similarity). Given an ordered labeled tree T and Query tree Q, 
edge similarity is defined by the ratio of the number of common edge to the total 
number of edge between T and Q. if there is an edge u— >v and u’— >v’ in the T and Q 
respectively, and also u=u’ and v=v’, we say that u— >v and u’— >v’ are matched. We 
can get the edge similarity by following formula. 

EdgeSimiQ, T) = I E Q n E T I 
I Eg U E t I 

Example 2. Figure 5 denotes two XML document trees to show how to compute edge 
similarity. 




We can obtain the following edge sets in tree T1 and T2(the element name is re- 
placed by the first alphabet). 

e ti = O'-*. H- t-m, t— >a, j— >d, a->f, a->l} 

E T2 = { f — >t » H- t-m, j— >a, j^d, a-»f, a— >1} 

So the edge similarity between T1 and T2 by definition 6 is computed as follows. 

EdgeSim( Tl, T2) = I E Tl n E Ti \ = 6_ 

\E Ti uE t J 8 

Definition 9 (Path Similarity). Given an ordered labeled tree T and Query tree Q, a 
path is denoted by consecutive edge from the root node to the particular node with the 
different depth(i.e., Vj— >v 2 , Vj— >v 2 — ^v 3 , Vj— >v 2 — >...— >v n ) and path similarity is de- 
fined by the ratio of the number of common path to the total number of path between 
T and Q. if there are paths Vj — > v 2 — >v 3 and v 3 ’— >v 2 ’ — >v 3 ’ in the T and Q respectively, 
and also Vj= Vj’, v 2 = v 2 ’ and v 3 = v 3 ’, we say that Vj— > v 2 — > v 3 and Vj’— >v 2 ’— >v 3 ’ are 
matched. We can get the path similarity by following formula. 

PathSii t?(Q, T) = Max com \P Q n P T \ 

MaX path I P Q tP T I 

where Max pall JPq, P r | is the largest path length among paths in the T, Q and Max- 
/,JP Q n P r | is the largest common path length in the T, Q. 
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We can obtain the following path sets in tree Tl, T2 in Figure 5. 

PathSimi Tl, T2) = Max com J « 3_ 

Max I 4 

:V,UA pat l, | 




According to above definition, the formula for computing similarity between Q and 
T that considers both edge similarity and path similarity is defined as following. 



where a> ji gives more emphasis to the edge similarity, and a< //gives more emphasis 
to the path similarity, by default a= 0.5, // = 0.5. 

Therefore, the similarity of tree Tl and T2 by equation (6) in Figure 5 is (0.5 *6/8 + 
0.5*3/4 = 0.75). 

We can estimate the degree of structural matching among trees by similarity meas- 
ure considering both edge similarity and path similarity, and if it is near 1, it means 
that the similarity is large. 

When detecting the most similar cluster by comparing the representative structure 
of clusters with the given query tree Q, we compute the similarity between the struc- 
tures of large items in each cluster and query tree Q by equation (6). After that, we 
calculate similarity between XML documents in the detected cluster and the given 
query tree Q in the same manner, and the search result is displayed to user, with the 
XML document list close to the query according to ranked similarity. 



6 Experiments and Implementation 

In this section, we evaluate the efficiency of the proposed clustering method and show 
the result of XML retrieval based on the method. 



6.1 The Clustering Experiments 

For our performance analysis, we have conducted some experiments, comparing our 
method XML_C with CLOPE. The used data were total 400 XML documents, se- 
lected from 8 topics (i.e., book, club, play, auction, company, department, actor, mov- 
ies), taken from the Wisconsin’s XML data bank [18]. We first extracted the represen- 
tative structures of each document by the frequent structure rate 20% to the whole 
documents. The average length and number of the frequent structures extracted about 
a document are 5.4 and 4.9 respectively. 

And we performed clustering, including the frequent structures of length over 80% 
of the maximal frequent structure length without redundancy. 

The comparison about the performance time according to the number of the docu- 
ments is shown in Figure 6. 

In Figure 6, we can see that the CLOPE takes more time than the XML_C in aver- 
age time. Examining Figure 6 more closely, we notice that XML_C takes more time 
even though the difference is little in the first stage of experiment. This means that it 
requires more time to construct the large items of each cluster. But, as the documents 
size increases, XML_C by the cluster participation using the notion of the large items 
comes into effect on the performance in contrast to CLOPE. 



Sim(Q, T) = a * EdgeSim( Q, T)+ [i *PathSim( Q, T) 
(a+ P = 1, a> 0,/?> 0) 



(6) 
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Fig. 6. Execution time 

We experiment how many documents are included in the cluster except outlier, 
which is related to the number of the produced cluster. The result of this experiment is 
shown in Figure 7. 

The number of cluster by CLOPE is averagely 1.3 times larger than that of 
XML_C. It means that CLOPE produces clusters over the regular size that contain the 
small number of documents than XML_C because it considers only the common rate 
of the items in the cluster, while we consider both the rate of common items and indi- 
vidual item using the notion of large items. 




Fig. 7. The document rate of clustering 

We also experiment the cluster cohesion and the inter-cluster similarity, about the 
total 400 documents. In order to compare with CLOPE, we extract the large items 
based on support from the clustering results, after running CLOPE. The result of the 
cluster cohesion and the inter-cluster similarity according to the minimum support is 
shown together in Figure 8. 




Fig. 8. Cluster cohesion and inter-cluster similarity 
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It is found that the cohesion of XML_C is higher than that of CLOPE, and inter- 
cluster similarity is lower than that of CLOPE, as can be easily noted in Figure 8. 
Therefore these experimental results also show that XML_C produces the better qual- 
ity clusters than CLOPE. This is because our method considers not only the rate of 
common items but also the rate of individual item in a cluster. 



6.2 Implementation 

To retrieve XML document, we have implemented the user interface in which user 
can input at most three elements of ordered hierarchical structure in the left part of 
window. Figure 9 shows search result of the ranked similar structured XML docu- 
ments about query ‘book/title’ in the right window. And also the contents of XML 
document selected by user are displayed under right corner of window. 




Fig. 9. A query and search result for retrieving XML documents 

In summary, the clustering of XML documents based on similar structure is effi- 
cient to search XML structure fast, and also it is effective in classifying XML docu- 
ment by the structure pattern in the large XML database. 

7 Conclusions 

In this paper we proposed a new similar structured based XML document clustering 
method that is quite different from the existing method. We first extracted the repre- 
sentative structures of XML documents using the sequential pattern mining, which 
focused on element paths of including hierarchal element information in the XML 
document. And then we performed clustering based on similar structure using notion 
of large items to improve cluster quality and performance, assuming that an XML 
document as a transaction and the extracted frequent structures from documents as the 
items of the transaction. Our experiments showed that our approach could get the 
higher cluster cohesion and the lower inter-cluster similarity, taking less time to per- 
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form than CLOPE. We also showed the effectiveness of our approach, applying into 
XML retrieval which is performed by computing similarity considering edge similar- 
ity and path similarity between XML documents and query tree. Therefore, our 
method is efficiently applicable to XML document management and classification in 
large size XML documents. 
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Abstract. Collaborative work requires, more than ever, access to data 
located on multiple autonomous and heterogeneous data sources. The 
development of these novel information platforms, referred to as infor- 
mation or data grids, and the evolving databases based on P2P concepts, 
need appropriate modeling and description mechanisms. In this paper we 
propose the Link Pattern Catalog as a modeling guideline for recurring 
problems appearing during the design or description of information grids 
and P2P networks. For this purpose we introduce the Data Link Model- 
ing Language, a language for describing and modeling virtually any kind 
of data flows in information sharing environments. 



1 Introduction 

With the rise of filesharing systems like Napster or Gnutella the database com- 
munity started to seriously adopt the idea of P2P systems to the formerly known 
loosely coupled databases. While the original systems were only designed to share 
simple files among a huge amount of peers, we are not restricted to these data 
sources any more. New developments allow peers to share virtually any data, no 
matter if it is originated from a relational, object-oriented, or XML database. In 
fact, the data may still come from ordinary flat files. 

Apparently we have to deal with a very heterogeneous environment of data 
sources sharing data, referred to as an information or data grid [4]. If we al- 
low participants to join or leave information grids at any time (e.g. using P2P 
concepts [3]), we must take a constantly changing constellation of peers into 
account. Any information grid built up by these peers can either evolve dynam- 
ically or be planned beforehand. In both cases we need a concept in order to 
describe and understand the interactions among the peers involved. Having such 
a mechanism, we could not only detect single data exchanges, but even model 
and optimize complex data flows of the entire system. 

In this paper we adopt commonly used methods for designing data exchanges 
among peers as Link Patterns, suitable especially for information grids and P2P 
networks. Analogous to the intention of the Design Pattern Catalog used for 
object-oriented software development [8] we want to provide modeling guidelines 
for engineers and database designers, engaged in understanding, remodeling, or 
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building up an information grid. Thus information grid architects are provided 
with a common vocabulary for design and communication purposes. 

Up to now data flows in information grids were designed without having a 
formal background leading to individual solutions for a specific problem. These 
were only known to a circlet of developers involved into that project. Other 
designers, engaged with a similar problem would never get in contact with these 
results and thus make the same mistakes again. Different modeling techniques 
make it difficult to exchange successfully implemented solutions. 

Link Patterns do not claim to introduce novel techniques for sharing, access- 
ing, or processing data in shared environments, but a framework for being able 
to understand, describe, and model their data flows. They provide a description 
of basic interactions between data sources and operations on the data exchanged, 
resulting in a catalog of reusable conceptual units. 

A developer may choose Link Patterns to model and describe complex data 
flows, to identify a single point of failure, or to avoid or consciously insert redun- 
dant data exchanges. The composition of Link Patterns is an essential feature of 
our design method. It gives us the possibility to represent a structured visual- 
ization not only of single data linkages, but of the entire information platform. 

The remainder of this paper is organized as follows. In section 2 we introduce 
DLML, a language for modeling data flows, followed by a structural description 
of the Link Patterns in section 3. Section 4 specifies the Link Pattern Catalog, 
followed by an example. Section 6 catches up some related work and section 7 
concludes. 

2 The Data Link Modeling Language (DLML) 

2.1 Introduction 

The Data Link Modeling Language (DLML) is based on the Unified Modeling 
Language (UML) [8] notation, but slightly modifies existing components, adds 
additional elements, and thus extends its functionality. It is a language for mod- 
eling, visualizing, and optimizing virtually any kind of data flows in information 
sharing environments. 

Modeling: DLML is a language, suitable for modeling, planning, and re-en- 
gineering data flows in information sharing environments, e.g. information 
grids, systematically. A Data Link Model built up using this language reflects 
the logical and not the physical structure of the entire system. It enables the 
developer to specify the properties and the behavior of existing and novel 
systems, in order to describe and understand their basic functionalities. 
Visualizing: Visualizing data flows is an important assistance in understand- 
ing the structure and behavior of an information platform. The impact of 
ER [13] and UML has proven, that a system is easier to grasp and less 
error-prone, if a graphical visualization technique is provided, which uses 
a well-defined set of graphical symbols, understood by a broad community. 
Especially within the analysis of systems with distributed information, it is 
favorable to have a method, suitable for drawing up a map of relationships 
between the participating peers, in order to depict global data flows. 
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Optimizing: Besides the modeling and visualization of an information sharing 
environment, DLML can be useful to optimize the whole distributed data 
management. Redundant data flows and data stocks can systematically be 
detected and removed, leading to a higher performance of the entire system. 
Of course, redundancy may explicitly be wanted, in order to achieve a higher 
fail-safety or a faster access to the data. 

Due to the characteristics mentioned above, the Data Link Modeling Lan- 
guage is especially suitable for visualizing data flows in distributed information 
grids. It may furthermore be employed to model data management in enter- 
prise information systems, data integration and migration scenarios, or data 
warehouses, i.e. wherever data has to be accessed across multiple different data 
sources. 

2.2 Components 

Since DLML is based on UML, its diagrams are constructed in an analogous 
manner, using a well-defined set of building blocks according to specific rules. 
The following components may be used in DLML (Fig. 1) to build up a Data 
Link Model: 



NodeName:DataStockName 




Data Node Data Node Application Node Comment 

with Role 

Fig. 1. DLML Components 



Nodes: Nodes are data sources, data targets, or applications, usually involved 
in a data exchange process. They may either be isolated or connected through 
at least one data flow. A data source may be a database (e.g. relational), a 
flat file (e.g. XML), or something similar, offering data, whereas a data target 
receives data and stores it locally. An application is a software unit, which 
accesses or generates data, without maintaining an own physical data stock. 
Physical data stocks are represented in DLML by Data Nodes , applications 
by Application Nodes. 

Label: Each node can have a label. It consists of generally two parts separated 
by a colon: the node name and the data stock name or application name 
respectively. The data stock name identifies the combination of data and 
schema information stored at this node. If this data is replicated as an exact 
and complete copy to another node, the data target has to use the same data 
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stock name. The application is identified by the application name. Analogous 
to the data stock name, any further instances of the same application have 
the same application name. In both cases we use the node name to distinguish 
nodes with the same data stock or application name. Otherwise the node 
name is optional. 

Location: The optional location tagged value specifies the physical location of 
the node. It either specifies an IP address, a server name, or a room number, 
helping the developer to locate the Data or Application Node. 

Role: A node providing a certain functionality on the data processed, may have 
a functional role (e.g. filtering or integrating data). This role will usually be 
implemented as a kind of application, operating directly on the incoming or 
outgoing data. The name of the role or its abbreviation is placed directly 
inside the symbol of the node. This information is not only useful for increas- 
ing the readability of the model, but also for being able to identify complex 
relationships. 

Data Flow: The data exchange between exactly one data source and one data 
target is called data flow. The arrow symbolizes the direction, in which data 
is being sent. A node may have multiple incoming and outgoing data flows. 
Optionally each data flow may be labeled concerning its behavior, i.e. if the 
data is being replicated («copy») to the data target or if it is just accessed 
(<<access>>). If data is being synchronized, both data flow arrows may be 
replaced by one single arrow with two arrowheads. 

Comment: A comment may be attached to a component, in order to provide 
additional information about a node or a data flow. These explanations may 
concern a node’s role, filter criteria, implementation hints, data flow proper- 
ties, or further annotations important for the comprehension of the model. 



2.3 Example 

We now illustrate the usage of the Data Link Modeling Language with a simpli- 
fied example. Consider a worldwide operating wholesaler, with an autonomous 
overseas branch. The headquarters is responsible for maintaining the product 
catalog (hq: products) with its price list, while the customers database 
(: customers) is administrated by the branch itself (Fig. 2). 

The overseas branch is connected to the headquarters by a dial-up connec- 
tion, not sufficient for accessing the database permanently. For this reason, the 
product catalog is replicated to the branch twice a day (branch: products), 
where the data may be accessed by the local employees. The branch manage- 
ment uses a special application (: management App) to access both data stocks in 
order to generate the annual report for the headquarters. 

3 Link Patterns 

In order to be able to provide a catalog of essential Link Patterns it is necessary 
to understand what a Link Pattern is. Therefore we present the elements a Link 
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hq:products branch:products :manaqementApp 

{location = hq.myserver.com} {location = Server A} {location = New York} 




Pattern is composed of, including its name, its classification, or its description. 
For graphical representation we use the Data Link Modeling Language, specified 
above. 

3.1 Elements of a Link Pattern 

In this section we present the description of the Link Pattern structure. It is 
based on the Design Pattern Catalog of Gamma et al. [8], which has reached 
great acceptance within the software engineering community. Thus a developer 
is able to quickly understand and adopt the main concept of each Link Pattern 
for his own purposes. Each Link Pattern is described by the following elements: 

Name: The name of a Link Pattern is its unique identifier. It has to give a first 
hint on how the pattern should be used. The name is substantial for the 
communication between or within groups of developers. 

Classification: A Link Pattern is classified according to the categories de- 
scribed in section 3.2. The classification organizes existing and future pat- 
terns depending on their functionality. 

Motivation: Motivating the usage of the pattern is very important, since it 
explains the developer figuratively the basic functionality. This is done using 
a small scenario, which illustrates a possible application field of the pattern. 
Therewith the developer is able to understand and follow the more detailed 
descriptions in the further sections. 

Graphical Representation: The most important part of the pattern descrip- 
tion is the graphical representation. It is a DLML diagram and describes the 
composition and intention of the pattern in an intuitive way. The developer 
is advised to adopt this representation, wherever he has identified the related 
functionality in his own information grid model. 

Description: The composition of the Link Pattern is described in-depth in this 
section, including every single component and its detailed functionality. The 
explanation of the local operations on each node and data flows between the 
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components involved, points up the intended functionality of the whole pat- 
tern described. This description shall give the user both, a guidance through 
the identification process and instructions for its proper usage. 

Challenges: Besides the general instructions given in the prior section, this sec- 
tion shall give hints for sources of error in the implementation process of this 
pattern. The developer shall get ideas, of how to identify and avoid pitfalls, 
arising in a certain context (e.g. interaction with other Link Patterns). 



3.2 Classification 

A classification of the Link Pattern Catalog shall provide an organized access 
to all Link Patterns presented. Patterns situated in the same class have similar 
structural or functional properties, depending on the complexity of their imple- 
mentation. Although a categorization of a very limited number of patterns may 
seem superfluous, we have decided to include this into our Link Pattern Cata- 
log, since it shall help developers to allocate and evaluate the pattern required. 
Furthermore it should stimulate the developer to find and rate novel patterns, 
not yet included in the catalog. 



Link Patterns 




Elementary Composed 




Data Sensitive Data Independent 

Fig. 3. Link Pattern Catalog Classification 



Figure 3 depicts the classification of our Link Pattern Catalog we have chosen. 
The patterns presented can be divided into two main categories, Elementary 
Link Patterns and Composed Link Patterns. In fact this classification is not 
completed, but shall provide a starting point for further extension. 

Elementary Link Pattern: An Elementary Link Pattern is the smallest unit 
for building up an information grid model. It consists of exclusively one single 
node and at least one data flow connected to it. Each Data Link Model is 
composed of several Elementary Link Patterns, linked together with data 
flows in an appropriate way. Please note, that a single Elementary Link 
Pattern is not yet a reasonable Data Link Model, since any data flow must 
have at least one node offering data and one node receiving data. 
Elementary Link Patterns are easy to understand and easy to implement, 
since they concern only a single node, a small set of data flows, and do not 
include basically any data processing logic. It must be pointed out, that the 
Elementary Link Patterns consist only of two main patterns, the Basic Data 
Node and the Basic Application Node, and its derivatives (e.g. Publisher and 
Generator, discussed in section 4). 
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Composed Link Pattern: Composed Link Patterns are built up by combin- 
ing at least two Elementary Link patterns in a specific way, in order to 
realize a particular functionality. A Composed Link Pattern may hereby be 
composed out of both, Elementary or other Composed Link Patterns. A pat- 
tern has to represent a prototype or solution for a recurring sort of problem. 
Please keep in mind, that an arbitrary combination of different patterns will 
not automatically lead to a reasonable Composed Link Pattern. 

In contrast to the Elementary Link Patterns, we have to deal in this context 
with a more complex kind of patterns. They do not only include more nodes, 
but may even represent a quite sophisticated way of linking them. Besides, 
each node may additionally process the data received or sent. The fact, that 
it may act differently depending on the data involved, is an essential property 
of Composed Link Patterns and justifies the creation of two subclasses: 

Data Sensitive Link Pattern: As soon as a node included in a Composed 
Link Pattern acts depending on the data it processes, the entire pattern 
is called a Data Sensitive Link Pattern. This data processing logic im- 
plemented on such a node may depend on and be applied to incoming 
and/or outgoing data. The operations of this application can either cre- 
ate, alter, or filter data. 

Data Independent Link Pattern: Any Composed Link Pattern, not 
classified as Data Sensitive, belongs to this class. In contrast to the pat- 
terns described above, data is not being modified, but sent or received 
as is. A rather crucial topic is the topology of the nodes and data flows 
involved, which is most relevant for the creation and functionality of this 
kind of patterns. 



3.3 Usage 

This section describes how Link Patterns can be useful to develop, maintain, 
analyze, or optimize both, straightforward and complex data flows in information 
grids. There are basically two methods, how Link Patterns can improve the work 
of developers: 

Analyzing existing systems: Many existing information grids have arisen 
during the years without being planned centrally or consistently. Even if 
they were planned initially, they usually tend to spread in an uncontrolled 
way. In such an environment it is vital to have supporting tools, helping to 
understand and later optimize an existing system. 

First of all a map or model of the existing system has to be created, e.g. with 
DLML presented in section 2. Afterwards we examine successively smaller 
parts of the model, in order to match them to existing Link Patterns of the 
Catalog. As a result we get a revised model containing basic information 
on the composition and functionality of subsystems, including their data 
processing and data flows. With this information in mind, we are now able 
to derive information on data flows and interaction of nodes inside the Data 
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Link Model. This enables us to perform optimizations like detecting and 
eliminating vulnerabilities or handling redundancies. 

Link Patterns may thus not replace human expertise for understanding ex- 
isting information grids, but give support in the process of recognizing global 
data flows and therewith interpret the purpose of the entire system. 

Composing new models: As already mentioned a Link Pattern may not only 
improve the process of understanding an existing information grid, but is also 
a support for modeling new systems. An information grid architect needs to 
have a clear idea of what the system should do. Depending on the data 
sources available, the local requirements on the nodes, and the results he 
wants to achieve, he can combine nodes and data flows, according to Link 
Patterns, until the entire system realizes the intended functionality. Link 
Patterns hereby guarantee a common language, understood by other devel- 
opers, not yet involved in the modeling. Each developer is thus able to quickly 
get a general idea of the system modeled at any time. Furthermore they ac- 
celerate the development process, since they provide well tried solutions for 
recurring problems, leading to a performant system of high quality. 



4 Link Pattern Catalog 

In this section we finally give an introduction into the Link Pattern Catalog. This 
includes a graphical overview over the main Link Patterns in DLML, as well as 
a detailed description of selected patterns. As mentioned beforehand the Link 
Patterns can be classified according to the classification presented in section 3.2. 
Since any Composed Link Pattern either belongs to the Data Sensitive or to the 
Data Independent Link Patterns, we organize the catalog as follows: 

Elementary Link Patterns 

The Elementary Link Patterns are the basic building blocks of a Data Link 
Model. They consist of the two basic patterns, described below, and its deriva- 
tives. All Elementary Link Patterns are depicted in Figure 4. 

Basic Data Node 

Classification: Elementary Link Pattern 

Motivation: This pattern is one of the basic building blocks of a Data Link 
Model. Each incoming or outgoing data flow of a Data Node is modeled 
using this Link Pattern. 

Graphical Representation: See Figure 4 

Description: A Basic Data Node is a DLML Data Node, which receives data 
through incoming data flows, stores it locally, and simultaneously propagates 
data, held in its own data stock. If a Basic Data Node does only have outgoing 
or incoming data flows, it applies the Publisher Pattern or the Subscriber 
Pattern respectively. If it does neither have any incoming, nor any outgoing 
data flows, the Data Node is called isolated. 
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Basic Data Node 




Basic Application Node 




Subscriber Publisher 




Consumer Generator 



Fig. 4. Elementary Link Patterns 



Challenges: One of the main challenges to take in this pattern is the proper 
coordination of incoming and outgoing data flows. At first all incoming data 
has to be stored permanently on the local data stock, without violating any 
constraints, before it may be propagated again to other nodes. 

Basic Application Node 

Classification: Elementary Link Pattern 

Motivation: This pattern is one of the basic building blocks of a Data Link 
Model. All applications, relevant for a Data Link Model, are based on this 
pattern. 

Graphical Representation: See Figure 4 

Description: An application interacting with arbitrary Data or Application 
Nodes, is represented by this pattern. The application does not only receive, 
but also propagate data. If a Basic Application Node does only have outgoing 
or incoming data flows, it applies the Generator Pattern or the Consumer 
Pattern respectively. If it does neither have any incoming nor any outgoing 
data flows, the Application Node is called isolated. 

Challenges: Propagated data can either be received or generated. All data 
manipulations on incoming data, which have to be propagated, have to be 
processed in real-time, without storing data locally. 




Fig. 5. Data Independent Link Patterns 
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Data Independent Link Patterns 

The Data Independent Link Patterns belong to the Composed Link Patterns. 
These patterns describe a functionality, which only depends on their structure, 
i.e. the way nodes and data flows are combined. A graphical overview of the 
patterns in this class is given in Figure 5, of which the Data Backbone is described 
exemplarily. 

Data Backbone 

Classification: Data Independent Link Pattern 

Motivation: A Data Backbone is used, wherever a centralization of data shar- 
ing or data access has to be realized. This is typically required, if data stocks 
are re-centralized, a central authority wants to keep track on all data flows, 
or data exchanges have to be established among multiple data stocks and 
applications. 

Graphical Representation: See Figure 5 

Description: The Data Backbone Pattern consists of several nodes, linked to- 
gether in a specific way. A designated node, called Data Backbone, is either 
data source or data target for all data flows in this pattern. All nodes, in- 
cluding the Data Backbone itself, can be data stocks or applications. Data is 
always propagated from data sources to the Data Backbone, where it may be 
accessed or propagated once again to other target nodes. Direct data flows 
between nodes, which are not the Data Backbone, are avoided. 
Challenges: Since the Data Backbone is involved in all data flows, it has a 
crucial position in this part of the information grid. Thus, a Data Backbone 
node has to provide a high quality of service, concerning disk space, network 
connection, and processing performance. If the quality of service required 
cannot be provided, the Data Backbone may easily become a bottleneck. 
Furthermore a breakdown of this node could lead to a collapse of the entire 
data sharing infrastructure, which makes it to a single point of failure. 

Data Sensitive Link Patterns 

Contrary to the Data Independent Link Patterns, the patterns described in this 
section are not only classified according to their structural properties, but partic- 
ularly because of their data processing functionality. A graphical representation 
of these Data Sensitive Link Patterns can be found in Figure 6, while a detailed 
description is only given for the Gatekeeper Pattern. 

Gatekeeper 

Classification: Data Sensitive Link Pattern 

Motivation: A Gatekeeper is used to control data flows according to specific 
rules (e.g. Access Control Lists), stored separately from the data processed. 
It is responsible for providing the target nodes with the accessible data re- 
quired. The application of this pattern is not limited to data security matters. 
It may actually be applied to any node, which has to supply different target 
nodes with specific (e.g. manipulated or filtered) data flows. 
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Fig. 6. Data Sensitive Link Patterns 



Graphical Representation: See Figure 6 

Description: A Gatekeeper is a designated node, which distributes data ac- 
cording to specific rules, eventually stored separately. Local or incoming 
data of a Gatekeeper is accessed by target nodes. Before this access can be 
admitted, the Gatekeeper has to check the permissions. Thus, corresponding 
to the rules processed, neither all data stored in the Gatekeeper, nor all data 
requested by the target nodes has to be transmitted. 

Challenges: The rules and techniques, which are used by the Gatekeeper in 
order to secure access to the data, have to be robust and safe. The Gatekeeper 
needs a mechanism to identify and authenticate the source and target nodes 
(e.g. IP address, public key, username and password, or identifiers [11]), 
which may be stored in a separated data stock. Due to its vital position in 
the exchange process, this information has to be protected from unauthorized 
access. The Gatekeeper must be able to rely on the correctness, authenticity, 
and availability of the rules required. 



5 Example 

This section provides an example of how to model a new information grid of a 
worldwide operating company. The headquarters of the company are located in 
New York. It has additionally branches in Diisseldorf (head office of the European 
branches), Paris, Bangalore, and Hong Kong. Each branch maintains its own 
database containing sales figures, collected by local applications. For backup 
and subsequent data analysis, this data has to be replicated to the headquarters. 
Additionally, the Diisseldorf branch needs to be informed about the ongoing sales 
activities of the Paris branch. To simplify the centralized backup, the company 
has decided to forbid any data exchanges between the single branches. 

The central component of this infrastructure is the backup system in New 
York. It collects the sales data from all branches, without integrating them. 
Additionally it provides the Diisseldorf branch with all the information required 
from Paris. Since the headquarters in New York want to analyze the entire data 
stock of the company, a data warehouse, based on the data of the backup system, 
is set up. Having a certain local autonomy, the data provided by the European 
branches and the remaining branches have some structural differences. For this 



Link Patterns for Modeling Information Grids and P2P Networks 



399 



reason, the data has to be integrated prior to the aggregation required for the 
data warehousing analysis. 

Using the Link Patterns proposed in this paper, we are now able to model 
the enterprise information grid as depicted in Figure 7. 



:SalesD 

{location = Dusseldorf} 




Fig. 7. Example using Link Patterns 



The local applications, which maintain the local sales databases, are modeled 
using the Data Processor. This data is replicated to the backup system in New 
York, realized as a Gatekeeper. It thus controls the data flows from the branches 
to the data warehouse and to the Dusseldorf branch. It must be guaranteed, that 
the data targets get only their designated data, i.e. neither data from Bangalore, 
nor from Hong Kong is accessible for the European head office in Dusseldorf. 
The data warehouse is realized by a node, which integrates several data sources 
using common integration strategies ( Integrator Pattern) and aggregates the 
data afterwards ( Aggregator Pattern), in order to provide OLAP applications 
with a homogenous data stock. 

Please keep in mind, that the Data Link Model presented in Figure 7 reflects 
the logical structure of the information platform, not the physical. This means, 
that the nodes of the model do not have to be located on different machines. 

6 Related Work 

Data Flow analysis and modeling has been a focus of researchers for decades. Ear- 
lier work concentrates mainly on data flows in computer architectures and soft- 
ware components (e.g. [15, 5]). Later on, data flows were also used for query pro- 
cessing and optimization in database systems. For instance, Teeuw and Blanken 
[14] compare control versus data flow mechanisms controlling the execution of 
database queries on parallel database systems. 
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Dennis and Misunas present in [6] a Basic Data-Flow language, which ex- 
presses graphically the data dependencies within a program. In this data flow 
graph model, instructions are represented by nodes and paths stand for data 
or control flows. Although this language was originally designed for software 
development, it may be seen as an early forerunner, in designing data flows 
among different data sources. A specialized data flow graph is introduced by 
Eich and Wells [7], which can be used for scheduling database queries within 
multiprocessor environments or databases distributed over a network [1], Thus, 
both approaches apply data flow concepts to database processing. 

The Link Patterns are tightly coupled to the Design Patterns of the object- 
oriented software design [8, 2] and Enterprise Application Integration (EAI) [10], 
since they represent prototypes or solutions for recurring problems. Contrary to 
these patterns, Link Patterns are not intended to solve recurring problems in 
software design or EAI, but to provide modeling and description guidelines for 
information grids, focusing exclusively on data flows. 

As a possible application field of our Link Patterns we suggest modeling 
or visualizing information grids, i.e. heterogeneous environment of data sources 
sharing data, or modern information infrastructures, based on P2P concepts (e.g. 
[9] or [12]). 

7 Conclusion and Future Work 

In this paper we have presented Link Patterns as guidelines for modeling and 
describing data flows between nodes in information sharing environments. The 
Link Pattern Catalog consists of prototypes or solutions for recurring problems 
and therewith supports developers to model, describe, and understand complex 
information grids. Furthermore the Link Patterns provide a common vocabu- 
lary for design and communication purposes, enabling developers to exchange 
successfully implemented solutions. 

Additionally we have introduced the Data Link Modeling Language (DLML) 
for modeling, visualizing, and optimizing data flows, especially suitable for in- 
formation grids. This language based on UML consists of a well-defined set of 
building blocks, representing data nodes, application nodes and data flows be- 
tween them. They can be combined according to specific rules, to build up the 
Data Link Model of an information sharing environment. 

The concepts we have presented in this paper are ideal to generate a static 
model of data and application nodes with their corresponding data flows. In fu- 
ture work we have to consider dynamically changing and evolving environments, 
in which nodes constantly join or leave the grid. This may not only affect the 
Link Pattern Catalog, but also the Data Link Modeling Language. Furthermore 
the Catalog has to be enhanced, in order to include novel Link Patterns, not 
yet identified. The entire Link Pattern Catalog shall provide developers with an 
extensive reference guideline for modeling information sharing environments. 
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Abstract. Design and maintenance of large corporate Web sites have 
become a challenging problem due to the continuing increase in their size 
and complexity. One particular feature present in the majority of this sort 
of Web sites is searching for information. However the solutions provided 
so far, which is based on the same techniques used for search in the open 
Web, have not provided a satisfactory performance to specific Web sites, 
often resulting in too much irrelevant content in a query answer. This 
paper proposes an approach to Web site modelling and generation of 
intrasite search engines, combining application modelling and informa- 
tion retrieval techniques. Our assumption is that giving search engines 
access to the information provided by conceptual representations of the 
Web site improves their performance and accuracy. We demonstrate our 
proposal by describing a Web site modelling language that represent 
both traditional modelling features and information retrieval aspects, as 
well as presenting experiments to evaluate the resulting intrasite search 
engine generated by our method. 



1 Introduction 

The continuing increase in size and complexity of Web sites has turned their 
design, construction and maintenance into a challenging problem. It often in- 
volves access to databases, complex cross referencing between information and 
sophisticated user interaction. This is particularly true for data-intensive Web 
sites, which are subjected to frequent content updates. 

In the same way, finding the desired information among the pages of large and 
complex Web sites is an important problem. This is why one of the most popular 
and useful feature in any large data-intensive Web site is allowing to search its 
content by means of a search engine. We refer to this sort of information retrieval 
(IR) system as intrasite search engine. Such systems represent an alternative for 
the navigation-based access and currently is present in most corporate Web sites 
with many commercial products available. However the effectiveness of these 
systems is questionable. In fact, recent studies show that current intrasite search 
engines usually fail in satisfying user queries by providing too much irrelevant 
answers in the result [17,33]. 
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The major problem faced by intrasite search engines is that there are few 
information available to reason about the relevance of a page to a user query. 
Intrasite search engines are usually developed using the same techniques applied 
by conventional global search engines, such as Google or Altavista. However, 
some important sources of evidence that determine the relevance of a page in a 
global search engine are usually not available in intrasite search. For instance, 
relevance metrics based on link analysis such as Pagerank [2] or HITS [24] make 
no contribution to improve the quality of the retrieval over small portions of the 
Web [18]. 

On the other hand, most corporate Web sites and Intranets are developed 
based on data models that define conceptual, logical and physical (i.e., naviga- 
tional and presentational) features of the Web site prior to the generation of the 
pages. These data models are rich sources of information that we believe can be 
used as evidences to determine the relevance of Web pages given a user query. 
Although there are several research work on Web site modelling and automated 
generation [28,26,11,6,15], to our knowledge none of these methods provide 
means for using modelling-related information to cope with intrasite search en- 
gines. Similarly, no current intrasite search systems available use information 
retrieval models that are able to make use of the Web site modelling to provide 
better results. 

In Web site design, a principle accepted by many authors is separation be- 
tween information content, navigation structure and visualization [12,28]. This 
idea promotes a better understanding of the data requirements (content), the 
underlying architecture of the site (navigation) and an appropriate user inter- 
face (visualization). Furthermore it makes maintenance tasks easier as each of 
those components can be managed separately [5,30]. Recent technologies such 
as XML, XSL and style sheets also promote separation between content and 
visualization, encouraging and facilitating the development of methods for Web 
site construction based on those concepts. 

Our proposal for Web site development is based on these ideas but innovates 
by modelling IR aspects of the application. Our assumption is by modelling 
specific information retrieval attributes of the information content of a Web site, 
it is possible to develop search engines that reach a significative improvement 
in the overall ranking quality. In the experiments presented here, our approach 
has given a 48% of improvement in the average precision when compared with 
traditional implementations of intrasite search engines. Our proposal merges an 
IR aware methodology and a model aware intrasite search engine development. 

Throughout the paper, we use the terms Web site and Web site application 
interchangeably because our method can be applied to both of them. However, 
due to space limitation the description of forms and operations (dynamic pages), 
which are the main characteristic of Web site applications, is not described in 
detail here. We focus on modelling information retrieval aspects and generation 
of intrasite search engines. 

The following section discusses some related work either on Web site appli- 
cation development or on information retrieval for individual Web sites. Sec- 
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tion 3 presents our intermediate representation language and Section 4 discusses 
the details of generating search engines using the proposed approach. Section 5 
presents experiments carried out based on an example Web site and search en- 
gine. Finally, Section 6 presents some final remarks. 



2 Related Work 

In this section we discuss some existing methods for development and mainte- 
nance of Web site applications as well as information retrieval work addressing 
intrasite or Intranet searching. 

2.1 Web Site Development Methods 

There is a general agreement amongst most recent work on some core concepts 
regarding Web Site Development, such as separation between information con- 
tent, navigation structure and visualization; declarative specifications, by means 
of high-level conceptual data models or declarative languages; and automated 
or semi-automated generation of Web site by means of CASE tools. 

The main differences between the proposals are in the emphasis given to par- 
ticular aspects of the development process. Most work focus on modelling aspects 
such as Araneus [26], Strudel [11], OOHDM [28], OO-H [15] and WebML [6]. 
There are also work based on semantic descriptions such as OntoWebber [21] 
and SeAL [25]. Our approach is a data-driven approach, which focuses on the 
generation of different visualizations and on an associated search engine. 

Where the work is driven by modelling most approaches are based on tra- 
ditional conceptual data models, such as the entity-relationship model (ER) [8] 
and its extensions or object-oriented data models. In this category we can in- 
clude Araneus, WebML, OO-H and OOHDM. WebML proposes a structural 
model compatible with the ER model, ODMG object-oriented data model and 
UML class diagrams. The OO-H method is based on an object-oriented model. 
OOHDM is also an object-oriented extension to HDM [13], a method for mod- 
elling hypermedia applications. Strudel models a Web site as graphs. OntoWeb- 
ber and SEAL are based on DAML+OIL and RDF, respectively, which are used 
to define ontologies describing the application domain. Our approach can sup- 
port different conceptual data models. We advocate the idea of using existing 
data models, provided the appropriate mapping procedures to our intermediate 
representation. 

Although existing methods such as those discussed above are appropriate for 
modelling and creating Web sites, to our knowledge, the issue of information re- 
trieval in Web sites is not addressed by any Web site development method. Our 
work is distinct by offering a framework for Web site modelling and construction 
including information retrieval aspects. This feature makes it possible to auto- 
matically generate a suitable search engine related to the Web site constructed. 
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2.2 Information Retrieval Approaches to Intrasite Searching 

The problem of searching in small portions of the Web has been directly or indi- 
rectly addressed in several recent works [20, 10]. In general, there is an agreement 
that current techniques applied for traditional global Web search, such as an- 
chor text information [10,9], URL level [31] and link analysis [23, 14], have little 
contribution for the ranking quality when applied for small portions of the Web, 
which is the case of intrasite search. The reasons for that are multifold. For 
instance, anchor text and link analysis rely on citations made from pages to 
other pages. However, this is only effective due to the inherent diversity of the 
content generate by the collectivity of the global Web. Additionally, as pages of 
data-intensive sites are in most cases generated dynamically, URLs exhibits no 
level information. Thus, the URL level technique is not applicable. 

Currently, several products have also been made available for intrasite search. 
Some examples of such products are the Cha-Cha intrasite search engine [7], 
the Google Search Appliance 1 and the AltaVista Enterprise search engine 2 , to 
name a few. However, based on their available documentation, these products 
do not consider any underlying Web site modelling as a source of evidence for 
computing their search results. In fact, no considerations are made regarding 
possible specific requirements of intrasite search. Google Search Appliance, for 
instance, is said to “use the same technology that powers Google search engine 
into intranets and corporate Web sites”. This may explain why the quality of 
results provided by most of the current intra sitesearch systems is close to the 
results obtained by the vector space model [27] (which is commonly used as the 
baseline model for information retrieval systems), while global search engines 
usually outperform by far this model. 

A more recent work presented by Xue et. al. [33] has proposed an approach 
for improving the quality of results in intrasite search systems by mining users’ 
access patterns. Their result has improved the precision in 16% for the top 
matches by constructing artificial links between the site Web pages and applying 
a variation of Pagerank to compute the importance of each page in the Web 
site. This idea is completely orthogonal to what we present here, and the two 
techniques can even be use in a complementary way to improve the quality of 
the results provided by the intrasite search systems. 

3 Modelling Information Retrieval Aware Web Sites 

Systematic approaches can bring many benefits to Web site construction, mak- 
ing development more methodical and maintenance less time consuming. One 
way to tackle this problem is by providing a high-level description of a Web 
site application independent from any particular implementation. This allows 
the designer to concentrate on the application description rather than on the 
mechanics for producing the Web site. 

1 http:/ /www. google. com/appliance/ 

2 http:/ /www. searchtools.com/tools/altavista.html 
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The proposed approach begins with a high level description of an application. 
For this task several existing data models such as the entity-relationship model 
[8] and its extensions or UML class diagrams can be used. From the applica- 
tion description an intermediate representation is (semi-) automatically derived. 
Depending on the data model used different mapping procedures must be de- 
fined in order to generate the intermediate representation. In [4] mapping rules 
for transforming ER diagrams into a logic-based Web site representation are 
proposed. 

In this paper we focus on the intermediate representation and the further 
steps to create Web applications. We assume that modelling and an appropriate 
mapping procedure from the data model concepts to our intermediate represen- 
tation language has already been performed. Conceptual modelling using known 
data models and mapping procedures to intermediate representation is a com- 
mon approach used by most techniques described in Section 2 

An intermediate representation is useful for several reasons. It provides a 
declarative specification of the application, still independent from the imple- 
mentation language but closer to Web site constructs, such as pages and links, 
than the original data model [3]. It also provides flexibility for generating dif- 
ferent views of a Web site, either in different target languages (HTML, XML, 
WML, etc) or different visualizations, for example by grouping pieces of infor- 
mation using different criteria or using different styles. Automated generation of 
Web sites is also easier with an intermediate representation which is combined 
with a visualization description in order to generate a corresponding Web site. 
Figure 1 illustrates this idea. 





Fig. 1. General Architecure of Web site Generator 



As discussed earlier our approach also follows the idea of separation between 
content, navigation and visualization. Information content refers to data to be 
displayed on the Web pages. Navigation structure defines the organization of the 
site and how items of information are related to each other. Finally, visualization 
concerns how the information will be presented on the Web pages comprising 
the site. Therefore, any notation for specifying such applications should provide 
mechanisms for representing those concepts. Another design principle is to define 
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a Web site application in a declarative way. This facilitates automated generation 
of Web site applications. 

In addition, we are also aiming at describing content relevance and other 
information retrieval aspects in order to produce a search engine associated to 
the Web site constructed. As a result the search engine will be able to use some 
“knowledge” about the Web site modelling, which gives more accurate results 
than other search engines within that particular Web site. Experiments with an 
example Web site and an associated search engine are discussed in Section 5. 
Following in this section we give the details of the proposed modelling language 
used to specify a Web site application according to our approach. 

3.1 An IR-Aware Web Site Modelling Language 

In our view, a Web site is a collection of pieces of information which are organised 
in display units related by transitions. These are the basic components of our 
modelling language which are detailed below. 

Pieces of information are structured “chunks” of data having an identifier and 
a type. Its actual content value (instance) often is described from a database or 
other data source. The appropriate access mechanisms to retrieve values for a 
piece of information must also be defined. We have defined types for pieces of 
information. Types are useful for choosing an appropriate visualization style 
for each piece of information, such as lists and tables. Hence it is important 
to note that we do not propose a data type system as traditionally defined in 
programming languages. The types defined are simple, list, tuple, table, form and 
searchlnfo. 

The general format of a piece of information is inf o(Id, Type, Datum) where 
Id is a unique identifier for the piece of information, Type is one of the pre-dehned 
types and Datum denotes an actual value of the piece of information. Note that 
a piece of information might have several instances depending on the instances 
in the data source. Figure 2 depicts the specification of a simple type piece of 
information using XML schema. 

Note that “Access” refers to the access method to the data source and “InfoS- 
tyle” refers to the visualization style for that piece of information. Other types 
of pieces of information are also defined in a similar fashion. A list denotes a 
piece of information composed of more than one data item of same simple type. 
A tuple refers to a list of related data items of different types and table is a list 
of tuples. Form refers to elements for data entry. This is a choice for facilitating 
the generation of pages with forms, since specific elements for constructing forms 
are usually offered by languages such as HTML and WML. Similarly, we have 
defined type searchlnfo which refers to the input parameters to the intrasite 
search engine. A typical implementation is a text box associated with a submit 
button that triggers the execution of the query. 

Display units are containers of pieces of information that will be presented 
to users in a visualization style defined by the designer. Display units can also 
be seen as classes of pages, where each page corresponds to an instance of a 
display unit. Therefore, a display unit might result in several pages of the same 
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<xs:element name="SimpleInf o" id="SimpleInfo"> 

<xs : complexType> 

<xs : sequence> 

<xs: element name="Datum" id="Datum"> 

<xs : complexType> 

<xs : simpleContent> 

<xs : extension base="xs : string"> 

<xs : attribute ref =" Comment "/> 

<xs : attribute ref="Type"/> 

<xs : extension> 

<xs : simpleContent> 

</xs : complexType> 

</xs : element> 

<xs : element ref="Detail" min0ccurs="0"/> 

</xs : sequence> 

<xs : attribute ref="Id"/> 

<xs : attribute ref ="Access"/> 

<xs : attribute ref ="Inf oStyle"/> 

</xs : complexType> 

</xs : element > 

Fig. 2. XML Schema for Simple Pieces of Information 



type, having the same pieces of information and visualization style, each page 
presenting different instances of pieces of information. Formally a display unit is 
defined as Display (Id, Infoi, Info 2 , ■ ■ ■ , Info n ) where Id is a unique identifier and 
Infoi corresponds to an identifier of a piece of information. Important attributes 
of display units related to information retrieval are IRDisplayContentType and 
IRSigniftcance which will be detailed in Section 3.2. A definition of display unit 
is presented in Figure 3. 

Similar to display units Transitions define classes of links. A transition is 
defined by an origin and a target. Origin of a transition can be a piece of in- 
formation or a display unit. The first defines links that have a data item as the 
link anchor, whereas the latter refers to simple navigation links, without related 
pieces of information. Given that both sorts of links and operations result in a 
display unit, the target is always a display unit. Operations are also considered 
as a type of transition, since it represents a link from a page to another via some 
computation. There are several issues related to representation and automated 
generation of operations which are left out here due to space restrictions. The 
definition of transitions is presented in Figure 4. 



3.2 Modelling Information Retrieval Aspects 

Our Web site modelling language supports the representation of information re- 
trieval aspects by defining special attributes of pieces of information and display 
units. These attributes are related to measuring the importance information. 
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<xs: element name=" Display" id="Display"> 

<xs : complexType> 

<xs : sequence> 

<xs: element name="Info" id="Info" maxOccurs="unbounded"> 
<xs : complexType> 

<xs : simpleContent> 

<xs : extension base="xs : string" > 

<xs : attribute ref="Id"/> 

<xs : attribute ref="IRSignif icance"/> 

<xs : extension> 

<xs : simpleContent> 

</xs : complexType> 

</xs : element> 

</xs : sequence> 

<xs : attribute ref="Id"/> 

<xs : attribute ref ="Access"/> 

<xs : attribute ref ="IRDisplayContentType"/> 

</xs : complexType> 

</xs : element > 



Fig. 3. XML Schema for Display Units 



There are two basic attributes used so far: IRDisplayContentType and IRSignif- 
icance, as defined in Figure 5. 

IRDisplayContentType allows the categorisation of display units with respect 
to its pieces of information. Possible values for IRDisplayContentType are: 

— Entry - A display unit is an entry point to the Web site when it includes 
a number of links to “sub-pages” and as a result pages of this type do not 
focus on any particular information content. 

— Content - This sort of display unit refers to those pages that include specific 
content often related to a particular subject. Each “sub-page” linked from 
the homepage or from an “Entry” page is usually a “Content” page. 

— Irrelevant - This denotes those pages that should be discarded by the search 
engine, since they do not provide any relevant content. 

As examples, the homepage of a University is an entry display unit, a page 
with a list of courses offered by an academic department is a content display 
unit and an error page such as “invalid login” is defined as irrelevant. 

^Significance supports the specification of the degree of importance of a 
piece of information with respect to the display unit it is presented. This is very 
important to search engines since it provides a measurable means to evaluate 
how a piece of information is related to the subject of its page. This means 
that the same piece of information can have different degree of importance if 
presented in another display unit. This is a key concept to develop search engines 
more adequate to intrasite search than using global search engine techniques for 
intrasite search. IRSignificance allows the search engine to make a distinction 
between information by their actual importance degree as modelled by the Web 
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<xs:element name="Transition" id="Transition"> 

<xs : complexType> 

<xs : sequence> 

<xs: element name="0rigin" id="0rigin"> 

<xs : complexType> 

<xs : simpleContent> 

<xs : extension base="xs : string"> 

<xs : attribute ref="Id"/> 

<xs : extension> 

<xs : simpleContent> 

</xs : complexType> 

</xs : element> 

<xs : element name="DisplayUnit" id="DisplatUnit"> 
<xs : complexType> 

<xs : simpleContent> 

<xs : extension base="xs : string" > 

<xs : attribute ref="Id"/> 

<xs : extension> 

<xs : simpleContent> 

</xs : complexType> 

</xs : element> 

</xs : sequence> 

</xs : complexType> 

</xs : element > 



Fig. 4. XML Schema for Transitions 



site designer instead of “guessing” it by their position in the page (text, header, 
title, footnote, etc). We have defined the following degrees of importance to 
pieces of information in a display unit: 

— High - assigned to pieces of information of higher priority to the process of 
information retrieval. Usually refers to content directly related to the main 
subject of the page where it is presented. 

— Medium - assigned to pieces of information that are not directly related to 
the main subject of the Web page. For example the name of a lecturer is 
“high” if it is placed in his/her personal homepage. However it can be defined 
as “medium” if the lecturer name is presented in a course page. 

— Low - assigned to pieces of information which low priority to search engines. 

— Irrelevant - assigned to pieces of information that can be discarded by the 
search engine. 

An example of a Web site modelling including the intrasite search engine 
design is presented in Section 5. In the next section we show how an intrasite 
search engine can make use of our modelling method discussed here. 
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<xs : attributename="RIDisplayContentType" def ault=" content " id="RIDisplayContentType" > 
<xs : simpleType> 

<xs : restriction base="xs : string" > 

<xs : enumeration value="entry"/> 

<xs : enumeration value=" content "/> 

<xs : enumerat ion value=" irrelevant "/ > 

</xs :restriction> 

</xs : simpleType> 

</xs : attribute> 

<xs : attribute name="RISignif icance" def ault=" irrelevant" id="Signif icance" > 

<xs : simpleType> 

<xs : restriction base="xs : string" > 

<xs : enumeration value="high"/> 

<xs : enumeration value="medium"/> 

<xs : enumeration value="low"/> 

<xs : enumeration value=" irrelevant "/> 

</xs :restriction> 

</xs : simpleType> 

</xs : attribute > 



Fig. 5. XML Schema for IR Aspects 



4 Modelling- Aware Intrasite Search Engines 

The creation of the intrasite search engine is accomplished by taking into account 
specific annotations present in Web pages generated. These annotations derive 
from specifications of attributes IRDisplayContentType and IRSignificance de- 
fined in the intermediate representation. This allows the immediate access to 
the information needed by the search engine to appropriately index the Web site 
content. This also avoids the need for accessing the intermediate representation 
to gather IR-related specifications. All information are properly annotated in 
the resulting Web pages written in a target language. 

In our experiments, we have used HTML as the target language. The IR 
annotations are found in each page as HTML comments (<!- - and - ->). In order 
to define the scope of each IR annotation, we have defined both opening and 
closing tags for IRSignificance. This is necessary because significance of a piece 
of information is relative to the page where it is presented. The same piece of 
information might have a different degree of relevance if presented in a different 
page. Since IRDisplayContentType is unique for each page, its annotation is 
placed before the tag <html> and its closing tag is placed after the tag </html>. 

Based on the IR annotations the intrasite search engine can improve the qual- 
ity and accuracy of query results. Only pages which have IRDisplayContentType 
equal to “content" are indexed. Pages defined as "entry” represent those pages 
which are entry points to the Web site. Notice that the distinction between con- 
tent and entry pages is necessary because user queries can also be classified in 
two categories [22,29]: (1) bookmark queries, which refer to locating an entry 
page for a specific site portion. For example, searching for the entry page of the 
economy section of a newspaper Web site. (2) content queries, which are the 
most common sort, denote user queries that result in single content pages. For 
example, finding a page that describes how the stock market operates. 
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Pages with I RDisplay Content Type "irrelevant” are automatically discarded. 
Annotating irrelevant pages is important as it makes the system to index pages 
with relevant content only, making the resulting search engine more efficient and 
accurate. 

Each piece of information has a level of significance defined by the value 
given to the attribute IR Significance. This feature specify the importance of 
a particular piece of information with respect to the page where it is placed. 
This means that the same piece of information might have a different level of 
significance when placed in another page. In the indexing process the system 
stores each piece of information, their location and their IRSignificance value. 
In addition, the number of occurrences of each term present in the piece of 
information is also stored. 

This information is used by our information retrieval model to compute the 
ranking of documents for each user query submitted to the intrasite search en- 
gine. The information retrieval model adopted here is an extension of the well 
know Vector Space Model [27]. This model is based on identifying the impor- 
tance how related is each term (word) t to each document (page) d, which should 
be expressed as a function w(d 1 1). The queries in this document are modelled in 
the same way, and the function w is used to represent each element modelled as a 
vector in a space determined by the set of all distinct terms. The ranking in this 
model is computed by the score function for each document d in the collection 
and a given query q as in the equation below. 



Sim(d, q) 



SL 1 w(d,t) ■ w{q,t ) 

t) 2 w{q, t) 2 



(1) 



which is the cosine between vectors d, and q and expresses how similar is doc- 
ument d to the query q. The documents which have similarity Sim{d 1 q) higher 
than zero are presented to the users in a descending order. 

The function w(d,t) in Equation 1 gives a measurement of how related term 
t and document d are. This value is usually computed as tf(d,t ) x idf(t ), where 
idf(t ) is the inverse document frequency and measures the importance of term t 
for the whole set of documents, while tf(d, t) expresses the importance of term 
t for document d. 

The idf value is usually computed as 



idf{t) = log( 



ffdocs ^ 

~1W ] 



(2) 



where ffdocs is the number of documents (pages) in the collection and f(t) is 
the number of documents where term t occurs. 

The tf value can be computed in several ways. However, it is always a func- 
tion of the term frequency in the document. Common formulae directly compute 
number of occurrences of t in d [16]. We here propose the use of information pro- 
vided from the Web site modelling to define the function tf based not only on 
the term frequency, but also in the IRSignificance described during the Web site 
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modelling. Given a Web page d composed of s different pieces of information 
{d\, ..., d s }, we define 

S 

tf(d,t) = freq(di 1 t) x I RSignificance(di) (3) 

i—1 

where freq{di,t) gives the number of occurrences of term t in the piece of in- 
formation di and IRSignificance{di) assigns values 0,1,2 or 3 corresponding to 
irrelevant, low, medium or high, respectively, for piece of information di derived 
from the Web site modelling. 

By using this equation, the system assigns to each piece of information a 
precise importance value, allowing ranking the pages according to the terms 
used in a query that match the most significant pieces of information. 

4.1 Generating a Web Site 

and Its Associated Intrasite Search Engine 

A high-level specification of an application must be provided as the starting 
point of the development process. Existing conceptual models can be used for 
this task as discussed earlier. Since issues related to mapping a data model 
constructs to our intermediate representation language are not in the scope of 
this paper, we assume that this task has been already carried out. Details of 
mapping procedures from an ER schema to our intermediate representation can 
be found in [4]. 

Once an intermediate representation of the Web site application is provided, 
the next step is the generation of the pages and the search engine. The steps to 
perform a complete generation involve: 

1. Instantiation of pieces of information, what usually involves access to data- 
bases. 

2. Creation of pages. For each display unit a number of corresponding pages are 
created depending on the number of instances of its pieces of information. 

3. Instantiation of links. 

4. Translation of all intermediate representation to a target language, such as 
HTML. 

5. Application of visualization styles to all pieces of information and pages, 
based on style and page templates definitions. 

6. Creation of additional style sheets, as CSS specifications. 

7. Creation of the intrasite search engine. 

Visualization is described by individual pieces of information styles, page 
styles (stylesheet) and page templates. A suitable interface should be offered to 
the designer in order to input all necessary. Currently, a standard CSS stylesheet 
is automatically generated including definitions provided by the designer. The 
reason to make use of stylesheets is to keep the representation for our visual- 
ization styles simple. Without a stylesheet, all visual details would have to be 
included as arguments to the mapping procedure which translates a visualization 
style to HTML. 
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As for the creation of the intrasite search engine, its code is automatically 
incorporated as part of the resulting Web site. Furthermore its index is generated 
along with the Web site pages, according to the IR Display Content Type and 
IRSignificance specifications. 

5 Experiments 

In this section we present experiments to evaluate the impact of our new inte- 
grated strategy for designing Web sites and intrasite search engines. For these 
experiments we have constructed two intrasite search systems for the Brazilian 
Web portal “ultimosegundo” , indexing 12,903 news Web pages. 

The first system was constructed without considering information provided 
by the application data model and it has been implemented using the traditional 
vector space model [27]. 

The second system was constructed using our IR-aware data model described 
in Section 4. To construct the second system we first modelled the Web site 
using our IR-aware methodology, generating a new version where the IRDis- 
play Content-Type of each page and the IRSignificance of the semantic pieces of 
information that compose each page are available. Figure 6 illustrates a small 
portion of the intermediate representation of the Web site modelled using our 
modelling language. The structure and content of this new site is equal to the 
original version, preserving all pages and keep them with the same content. 

The first side effect of our methodology is that only pages with useful content 
are indexed. In the example only pages derived from the display unit NewsPage 



<Inf o> 

<SimpleInfo Id="News0f TheHour" Access="BDQuery" > ... </SimpleInf o> 

<SimpleInfo Id="NewsTitle" Access="BDQuery"> ... < /Simple Inf o> 

<ListInfo Id="RecentNews" Access="BDQuery" Type=" string" > ... </ListInfo> 

</Info> 

<Display Id="HomePage" IRDisplayContentType=" entry" > 

<InfoId Id="NewsOfTheHour" IRSignif icance="high"/> 

<InfoId Id="RecentNews" IRSignif icance="medium"/> 

<InfoId Id="0therNews" IRSignif icance=" irrelevant "/> 

</Display> 

<Display Id="NewsPage" IRDisplayContentType=" content "> 

<InfoId Id="NewsTitle" IRSignif icance="high"/> 

<InfoId Id="NewsSuminary" IRSignif icance= "medium" /> 

<InfoId Id="NewsText" IRSignif icance="low"/> 

<InfoId Id="0therNews" IRSignif icance=" irrelevant "/> 

</Display> 

<Transition> 

<0rigin Id="News0f TheHour"/> 

<DisplayUnit Id="NewsPage"/> 

</Transition> 



Fig. 6. Example of a Partial Intermediate Representation of a Web Site 
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are indexed. Furthermore, pieces of information that do not represent useful 
information are also excluded from the search system. For instance, each news 
Web page in the site have links to related news (Other News), these links are 
considered as non-relevant pieces of information because they are included in 
the page as a navigation facility, not as content. As a result, the final index size 
was only 43% of the index file created to index the original site, which means 
our intrasite search version uses less storage space and is faster when processing 
user queries. 

The experiments evaluating the quality of results were performed using a 
set of 50 queries extracted from a log of queries on new Web sites. The queries 
were randomly selected from the log having an average length of 1.5 terms, as 
the majority of queries are composed of one or two terms. In order to evaluate 
the results, we have used a precision recall curve, which is the most applied 
method for evaluating information retrieval systems [1], The precision at any 
point of this curve computed using the set of relevant answers for each query 
(TV) and the set of answers given by each system this query ( R ). The formulae for 
computing precision and recall are described in Equations 4. For further details 
about precision recall curve the interested reader is referred to [1,32]. 



Precision = 



#(RON) 

#R 



Recall = 



#(^niv) 

#TV 



(4) 



To obtain the precision recall curve we need to use human judgment for 
determining the set of relevant answers for each query evaluated. This set was 
determined here using the pooling method used for the Web-based collections of 
TREC [19]. This method consists of retrieving a fix number of top answers from 
each of the system options evaluated and then make a pool of answers which is 
used for determining the set of relevant documents. Each answer in the pool is 
analyzed by humans and is classified as relevant or non relevant for the given 
user query. After analyzing the answers in the pool, we use the relevant answers 
identified by humans as the set N in the Equations 4. 

For each of the 50 queries of our experiments, we composed a query pool 
formed by the top 50 documents generated by each of the 2 intra site search 
systems evaluated. The query pools contained an average of 62.2 pages (some 
queries had less than 50 documents in the answer). All documents in each query 
pool were submitted to a manual evaluation. The average number of relevant 
pages per query pool is 28.5. 

Figure 7 shows the precision recall curve obtained in our experiment for 
both systems. Our Modelling-aware intrasite search is labelled in the figure as 
“Modelling-aware” , while the original vector space model is labelled as “Conven- 
tional”. This Figure shows that the quality of the ranking results of our system 
was superior in all points of recall. The precision at the first points in the curve 
was roughly 96% in our system, against 86.5% which means an improvement of 
almost 11% in the precision. For higher levels of recall the difference becomes 
ever higher, being roughly 20% at 50% of recall and 50% at 100%. This last 
result indicates that our system found in average 50% more relevant documents 
in this experiment. The average precision for the 11 points were 56% for the 
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Fig. 7. Comparison of average precision versus recall curves obtained when processing 
the 50 queries using the original vector space model and the IR-aware model 



conventional system and 84% for the Modelling-aware system, which represents 
an improvement of 48%. 

Another important data about the experiment is that our system has re- 
turned on the average only 209.8 documents per query (from these, we selected 
50 for evaluating) while the original system has returned 957.66 results on the 
average. This difference is again due to the elimination of non relevant infor- 
mation from the index. To give an example, the original system gave almost all 
pages as a result for the query “September 11th”, while our system gives less 
than 300 documents. This difference happened because almost all pages in the 
site had a footnote text linking a special section about this topic in the site. 

6 Conclusions 

We have presented a new modelling technique for Web site design that transfers 
information about the model to the Web pages generated. We also presented 
a new intrasite search model that uses this information to improve the quality 
of results presented to users and to reduce the size of the indexes generated for 
processing queries. In our experiments we have presented one particular example 
of application of our method that illustrates its viability and effectiveness. The 
gains obtained in precision and storage space reduction may vary for different 
Web sites. However this example has shown a good indication that our method 
can be effectively deployed to solve the problem of intrasite and Intranet search. 
For the site modelled we had an improvement of 48% in the average precision 
and at the same time a reduction in the index size, occupying only 43% of the 
space used by the traditional implementation. That means our method produces 
faster and more precise intrasite search systems. 

As future work we are planning to study the application of our method to 
other Web sites in order to evaluate in more detail the gains obtained and to 
refine our approach. We are also studying strategies for automatically compute 
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the IRSiginiftcance of pieces of information and for automatically determining 
the weights of each piece of information for each display unit. These automatic 
methods will allow the use of our approach for non-modelled Web sites which 
may be used for extending the benefits of our method to global search engines. 

The paradigm described here opens new possibilities for designing better intr- 
asite search systems. Another future research direction is defining new modelling 
characteristics that can be useful for intrasite search systems. For instance, we 
are interested in finding ways for determining the semantic relations between 
Web pages during the modelling phase and use this information to cluster these 
pages in a search system. The idea is to use the cluster properties for improving 
the knowledge about the semantic meaning of each Web page in the site. 
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Abstract. World wide web has gained a lot of prominence with respect to in- 
formation retrieval and data delivery. With such a prolific growth, a user inter- 
ested in a specific change has to continuously retrieve/pull information from the 
web and analyze it. This results in wastage of resources and more importantly 
the burden is on the user. Pull-based retrieval needs to be replaced with a push- 
based paradigm for efficiency and notification of relevant information in a 
timely manner. WebVigiL is an efficient profile-based system to monitor, re- 
trieve, detect and notify specific changes to HTML and XML pages on the web. 
In this paper, we describe the expressive profile specification language along 
with its semantics. We also present an efficient implementation of these pro- 
files. Finally, we present the overall architecture of the WebVigiL system and 
its implementation status. 



1 Introduction 

Information on the Internet, growing at a rapid rate, is spread over multiple reposito- 
ries. This has greatly affected the way information is accessed, delivered and dis- 
seminated. Users, at present, are not only interested in the new information available 
on web pages but also in retrieving changes of interest in a timely manner. More 
specifically, users may only be interested in particular changes (such as keywords, 
phrases, links etc). Push and Pull paradigms [1] are traditionally used for monitoring 
the pages of interest. Pull Paradigm is an approach where the user performs an ex- 
plicit action in the form of a query, transaction execution on a periodic basis on the 
pages of interest. Here, the burden of retrieving the required information is on the 
user and may result in changes being missed when a large number of web sites need 
to be monitored. In the push paradigm, the system is responsible for accepting user 
needs and informs the user (or a set of users) when something of interest happens. 
Although this approach reduces the burden on the user, naive use of a push paradigm 
results in informing users about the changes to web pages irrespective of the user’s 
interest. At present most of the systems use a mailing list to send the same compiled 
changes to all its subscribers. 
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Hence, an approach is needed which replaces periodic polling and notifies the user 
of the relevant changes in a timely manner. The emphasis in WebVigiL is on selec- 
tive change notification. This entails notifying the user about the changes to the web 
pages based on user specified interest/policy. WebVigiL is a web monitoring system, 
which uses an appropriate combination of push and intelligent pull paradigm with the 
help of active capability to monitor customized changes to HTML and XML pages. 
WebVigiL intelligently pulls the information using a learning-based algorithm [2] 
from the web server based on user profile and propagates/pushes only the relevant 
information to the end user. In addition, WebVigiL is a scalable system, designed to 
detect even composite changes for a large number of users. An overview of the para- 
digm used and the basic approach taken for effective monitoring is discussed in [3]. 

This paper concentrates on the expressiveness of change specification, its seman- 
tics, and its implementation. In order for the user to specify notification and monitor- 
ing requirements, an expressive change specification language is needed. 

The remainder of the paper is organized as follows. Section 2 discusses related 
work. Section 3 discusses the syntax and semantics of the change specification lan- 
guage which captures the monitoring requirements of the user and in addition sup- 
ports inheritance, event-based duration and composite changes. Section 4 gives an 
overview of the current architecture and status of the system. Section 5 concludes the 
paper with an emphasis on future work. 

2 Related Work 

Many research groups have been working to address detecting changes to documents. 
GNU dijf [4] detects changes between any two text files. Most of the previous work 
in change detection has dealt only with flat-files [5] and not structured or unstruc- 
tured web documents. Several tools have been developed to detect changes between 
two versions of unstructured HTML documents [6]. Some change-monitoring tools 
such as ChangeDetection.com [7] have been developed using the push-pull paradigm. 
But these tools detect changes to the entire page instead of user specified components 
and the changes can be tracked only on limited pages. 

2.1 Approaches for User Specification 

Present day users are interested in monitoring changes to pages and want to be noti- 
fied based on his/her profile. Hence, an expressive language is necessary to specify 
user-intent on fetching, monitoring and propagating changes. WebCQ [8] detects 
customized changes between two given HTML pages and provides an expressive 
language for the user to specify his/her interests. But WebCQ only supports changes 
between the last two pages of interest. As a result, flexible and expressive compare 
options are not provided to the user. AT&T Internet Difference Engine [9] views a 
HTML document as a sequence of sentences and sentence-breaking markups. This 
approach may be expensive computationally as each sentence may need to be com- 
pared with all sentences in the document. WYSIGOT [10] is a commercial application 
that can be used to detect changes to HTML pages. It has to be installed on the local 
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machine, which is not always possible. This system gives an interface to specify the 
specifications for monitoring a web page. It has the feature to monitor an HTML page 
and also all the pages that it points to. But the granularity of change detection is at the 
page level. 

In [11], the authors allow the user to submit monitoring requests and continuous 
queries on the XML documents stored in the Xyleme repository. WebVigiL supports 
a life-span for change monitoring request which is akin to a continuous query. 
Change detection is continuously performed over the life-span. To the best of our 
knowledge, customized changes, inheritance, different reference selection or corre- 
lated specifications cannot be specified in Xyleme. 

3 Change Specification Language 

The present day web user’s interest has evolved from mere retrieval of information to 
monitoring the changes on web pages that are of interest. As the web pages are dis- 
tributed over large repositories, the emphasis is on selective and timely propagation 
of information/changes. Changes need to be notified to the user in different ways 
based on their profiles/policies. In addition, the notification of these changes may 
have to be sent to different devices that have different storage and communication 
bandwidths. The language for establishing the user policies should be able to accom- 
modate the requirements of a heterogeneous distributed large network-centric envi- 
ronment. Hence, there is a need to define an expressive and extensible specification 
language wherein the user can specify details such as the web page(s) to be moni- 
tored, the type of change (keywords, phrases etc.) and the interval for comparing 
occurrence of changes. User should also be able to specify how, when, and where to 
be notified taking into consideration the quality of service factors such as timeliness, 
size vs. quality of notification. 

WebVigiL provides an expressive language with well-defined semantics for speci- 
fying the monitoring requirements of a user pertaining to the web [12]. Each monitor- 
ing request is termed a Sentinel. The change specification language developed for this 
purpose allows the user to create a monitoring request based on his/her requirements. 
The semantics of this language for WebVigiL have been formalized. Complete syntax 
of the language is shown in Fig 1 . 

Following are a few monitoring scenarios that can be represented using the above 
sentinel specification language. 

Example 1: Alex wants to monitor http://www.uta.edu/spring04/cse/classes.htm for 
the keyword “cse5331” to take a decision for registering the course cse5331. The 
sentinel starts from May 15, 2004 to August 10, 2004 (summer semester) and she 
wants to be notified as soon as a change is detected. Sentinel (si) for this scenario is 
as follows: 

Create Sentinel si Using http://www.uta.edu/ spring04/cse/classes.htm 

Monitor keyword (cse5331) 

Fetch 1 day 

From 05/15/04 To 08/10/04 
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Notify By email alex@aol.com Every best effort 
Compare pairwise 
ChangeHistory 3 
Presentation dual frame 

Example 2: Alex wants to monitor the same URL as in Example 1 for regular updates 
on new courses getting added but is not interested in changes to images. As it is cor- 
related with sentinel si, the duration is specified between the start of si and the end 
of si. The sentinel (s2) for the above scenario is: 

Create Sentinel s2 Using si 

Monitor Anychange AND (NOT) images 

Presentation only change 



<Sentinel> 


: = Create Sentinel esentinel-name> 




Using < sentinel-target> 

[Monitor < sentinel-type>] 

[Fetch dime interval> | on change] 

[From dime point> | efrom event>] 

[To dime point>| do event>] 

[Notify By ccontact options>] 

[Fvery dime interva!> | interactive | best effort | immediate | 
[Compare ccompare options>] 

[Change History cn>] 

( Presentationcpresent options>] 


csentinel-namc> 


: = Identifier 


<sentinel-type> 


: = [cunary op>Jechange type> [chinary op> 




cchange type>] 


cchange type> 


: = any change| all links| all images 




| all words [ except { eword 1 wordw > }] 

j table :{ctahle id> }| list : [clist id>) 

| phrase : [cphrasel>I,cphrase2>, ..<phrase/;>)} 
| regular expression : {cexp>} 

| keywords : (eword l> [, word2 ,..word//J| 


csentincl-target> 


::= sentinel csentinel nanu*>|curl> 


dime interval 


::= cinteger>{ second | minute| hour| day| week } 


dime point > 


::= cmonth>/eday>/cyear>[+ dime interval>] 




| Now [+ dime interval>] 


eunary op> 


::= NOT 


chinary op> 


::= AND | OR 


efrom event> 


::= start( csentinel name>)[+ dime interval>| 




| during (csentinel name» 

| end(csentinel name>)[+ time intervall 


<to event> 


::= start (csentinel name>)[+dime interval>] 




| end(csentinel name>)[+ time interval) 


ceontact options> 


::= email cemail address>| fax cfax no>| PDA edetails> 


ccompare options> 


::= pairwise | movingen> | everycn> 


<n> 


::= integer 


cpresent options> 


::= only change | dual frame 



Fig. 1 . Sentinel Syntax 



3.1 Sentinel Name 

This is to specify a name for user’s request. The syntax of sentinel name is Create 
Sentinel <sentinel-name>. For every sentinel, the WebVigiL system generates a 
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unique identifier. In addition, the system also allows the user to specify a sentinel 
name. The user is required to specify a distinct name for all his sentinels. This name 
identifies a request uniquely. Further it facilitates the user to specify another sentinel 
in terms of his/her previously defined sentinels. 

3.2 Sentinel Target 

The syntax of sentinel target is Using <sentinel- target>. The sentinel-target could be 
either a URL or a previously defined sentinel S ; . If the new sentinel S n , specifies the 
sentinel target as S ; then S n inherits all properties of S ; , unless the user overrides those 
properties in the current specification. 

In Example 7, Alex is interested in monitoring the course web page for the key- 
word ‘cse5331’. Alex should be able to specify this URL as the target on which the 
system monitors the changes on the keyword cse5331. Later Alex wants to get up- 
dates on the new classes being added to the page, as this may affect her decision for 
registering for the course cse5331. She should place another sentinel for the same 
URL but with different change criteria. As the second case is correlated with the first 
case, Alex can specify si as the sentinel target with a different change type. 

Sentinels are correlated if they inherit run time properties such as start and end 
time of a sentinel. Otherwise, they merely inherit static properties (e.g., URL, change 
type, etc. of the sentinel). The language allows the user to specify the reference web 
page or a previously placed sentinel as the target. 

3.3 Sentinel Type 

WebVigiL allows the detection of customized changes in the form of sentinel type 
and provides explicit semantics for the user to specify his/her desired type of change. 
The syntax of sentinel type is given as: Monitor <sentinel-type>, where sentinel type 
is sentinel-type= [ <unary op> ]<change type> [<binary op> <change typo ] 

In Example 7, Alex is interested in ‘cse5331’. Detecting changes to the entire page 
leads to wasteful computations and further sends unwanted information to Alex. 

In Example 2, Alex is interested in any change to the class web page but is not in- 
terested in the changes pertaining to images. 

WebVigiL handles such requests by introducing change type and operators in its 
change specification language. The contents of a web page can be any combination of 
objects such as set of words, links and images. Users can specify such objects using 
change type and use operators over these objects. Change Specification Language 
defines Primitive change and Composite change for a sentinel type. 

Primitive change: It is the detection of a single type of change between two versions 
of the same page. For keyword change, the user must specify a set of words. An ex- 
ception list can also be given for any change. For phrase change, a set of phrases is 
specified. For regular expressions, a valid regular expression is given. 

Composite change: It comprises of a combination of distinct primitive change(s) 
specified on the same page, using one of the binary operators AND and OR. The 
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semantics of composite change formed by the use of an operator can be defined as 
follows (Note that A, V, and ~ are Boolean AND, OR, and NOT operators, respec- 
tively). 

3.4 Change Type 

If V; and V 2 are two different versions of the same page, then Change C t on V 2 with 
reference to Vj, is defined as: 

C t (Vj, V 2 ) = True if the change type t is detected as insert in V 2 or delete in Vj 
False otherwise. 

The sentinel-type is the change type t selected from the set T where 
T = { any change, links, images, all words except <set of words>, phrase:<set of 
phrases>, keywords:<set of words>, table: <table id>, list :<list id>, regular ex- 
pression: <exp> j. 

Based on the form of information that is usually available on web pages change 
types may be classified as links, images, keywords, phrases, all words, table, list, 
regular expression and any change based on the form of information. 

Links: Corresponds to a set of hypertext references. In HTML, links are presentation- 
based objects represented between the hypertext tag (<A href='\”>). Given two ver- 
sions of a page, if any of the old links are deleted in the new version or new links are 
inserted, a change is flagged. 

Images: Corresponds to a set of image references extracted from the image source. In 
HTML, images are represented by the image source tag (< IMG src=”.”>). The 
changes detected are similar to the links except that the images are monitored. 

Keywords<set of words>: Corresponds to a set of unique words from the page. A 
change is flagged when any of the keyword (mentioned in the set of words) appears 
or disappears in a page with respect to the previous version of the same page. 

Phrase<set of phrases>: Corresponds to a set of contiguous words from the page. A 
change is flagged on the appearance or disappearance of a given phrase in a page 
with respect to the previous version of the same page. Update to a phrase is also 
flagged depending on the percentage of words that has been modified in a phrase. If 
the number of words changed exceeds above a threshold, it is deemed as a delete (or 
disappearance). 

Table: Corresponds to the content of the page represented in a tabular format. Though 
the table is a presentation object, the changes are tracked on the contents of the table. 
Hence, whenever the table contents are changed, it is flagged as a table change. 

List: Corresponds to the contents of a page represented in a list format. The list for- 
mat can be bullets or numbered. Any change detected on the set of words represented 
in a list format is flagged as a change. 

Regular expression <exp>: Expressed as valid regular expression syntax for query- 
ing and extracting specific information from the document data. 
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All words: A page can be divided into a set of words, links and images. Any change 
to the set of words between two versions of the same page is detected as all words 
change. 

All words encompass phrases, keywords and words in the table and list. While 
considering changes to all words, the presentation objects such as table and list are 
not considered and only the content in these presentation objects are taken into con- 
sideration. 

Anychange: Anychange encompasses all the above given types of changes. Changes 
to any of the defined set (i.e., all words, all links and all images) are flagged as any- 
change. Hence, the granularity is limited to a page for anychange. Any change is the 
superset of all changes. 

3.5 Operators 

Users may want to detect more than one type of change on a given page or the non- 
occurrence of a type of change. To facilitate such detections the change specification 
language includes unary and binary operators. 

NOT: A unary operator, which detects the non-occurrence of a change type. For a 
given change type t on version V 2 with reference to version Vj of the same page the 
semantics of NOT are: (NOT C t )(Vi,V 2 ) = ~C t (Vj,V 2 ) 

OR: A binary operator representing disjunction of change types. It is denoted by C t * 
OR C t 2 for two primitive changes C, 1 and C t 2 specified on version V-, with reference to 
version Vj of the same page. A change is detected if either C t * is detected or C t 2 is 
detected. Formally, (C, 1 OR C t 2 ) (V,,V 2 ) = C t 1 (V 1 ,V 2 ) V C t 2 (V,,V 2 ), where tl, t2 are 
the types of changes and tl<>t2 

AND: A binary operator representing conjunction of change types. It is denoted by 
C t * AND C t 2 for two primitive changes C t * and C t 2 specified on version V, with refer- 
ence to version Vj of the same page. A change is detected when both C t * and C t 2 are 
detected. Formally, (C/ AND C t 2 ) (V,,V 2 ) = C t > (V,,V 2 ) AC/IV^V,), where tl, t2 
are types of changes and tl <>t2 

The unary operator NOT can be used to specify a constituent primitive change in a 
composite change. For example, for a page containing the list of fiction books, a user 
can specify a change type as: All words AND NOT phrase {“ Lord of the Rings”}. A 
change will be flagged only if given two versions of a page, at least some words may 
change such as insertion of a new book and author etc. but the phrase “Lord of the 
Rings” has not changed. Hence, the user is interested in monitoring the arrival of new 
books or removal of old books, only as long as the book “Lord of the Rings” is avail- 
able. 
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3.6 Fetch 

Changes can be detected for a web page only when a new version of the page is 
fetched. New versions can be fetched based on the freshness of the page. The page 
properties (or meta-data) of a web page, such as the last modified date for static pages 
or checksum for dynamic pages define whether a page has been modified or not. The 
syntax of fetch is Fetch <time interval>\ on change. User can specify a ‘time inter- 
val’ indicating how often a new page should be fetched, or can specify ‘on change’ to 
indicate that he/she is unaware of the change frequency of the page. 

On change: This option relieves the user of knowing when the page changes. 
WebVigiL’s fetch module uses a heuristic-based fetch algorithm called Best Effort 
Algorithm [13] to determine the interval with which a page should be fetched. This 
algorithm uses change history and meta-data of the page. 

Fixed Interval <time interval> t d : User can specify a fixed user-defined fetch interval 
when a page is fetched by the system, t d can be in terms of minutes, hours, days or 
weeks (a non-negative integer). 

3.7 Sentinel Duration 

WebVigiL monitors a web page for changes during the lifespan of the sentinel. The 
lifespan of a sentinel is a closed interval formed by the start time and end time of 
sentinel. This is defined as: 

From <timepoint>\ <from event> To <timepoint>\<to event> 

Let the timeline be an equidistant discrete time domain having “0” as the origin and 
each time point as a positive integer as defined in [14]. Defining it in terms of the 
timeline, occurrences of the created Sentinel S are specific points on the time line and 
the duration (lifespan) defines the closed interval within which S occurs. The ‘From’ 
modifier denotes the start of a sentinel S and the ‘To’ modifier denotes the end of S. 
The start and end times of a sentinel can be specific times or can depend upon the 
attributes of other correlated sentinels. The user has the flexibility to specify the dura- 
tion as one of the following: (a) Now (b) Absolute time (c) Relative time (d) Event- 
based time 

Now: A system-defined variable that keeps track of the current time. 

Absolute time: Denoted as time point T, it can be specified as a definite point on the 
time line. The format for specifying the time point is MM/DD/YYYY. 

Relative time: It is defined as an offset from a time point (either absolute or event- 
based). The offset can be specified by the time interval t d defined in Section 3.6. 

Event-based time: Events, such as the start and end of a sentinel can be mapped to 
specific time points and can be used to trigger the start or end of a new sentinel. Start 
of a sentinel can also depend on the active state of another sentinel and is specified by 
the event ‘during’. During s ; defines that a sentinel should be started in the closed 
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interval of s ; and the start should be mapped to Now. When a sentinel inherits from 
another sentinel having a start time of Now, as the properties are inherited, the time of 
the current sentinel will be mapped to the current time. 

3.8 Notification 

Users need to be notified of detected changes. How, when and where to notify is an 
important criterion for notification and should be resolved by the change specification 
semantics. 

Notification Mechanism: The mechanism selected for notification is important espe- 
cially when multiple types of devices with varying capabilities are involved. The 
syntax for specifying the notification mechanism is given by: Notify By <contact 
options>. The <contact options> allows the users to select the appropriate mecha- 
nism for notification from a set of options O = (email, fax, PDA). The default is 
email. 

Notification Frequency: The notification module has to ensure that the detected 
changes are presented to the user at the specified frequency. The system should in- 
corporate the flexibility to allow users to specify the desired frequency of notification. 
The syntax of notification frequency has been defined as: best effort \ immediate \ 
interactive\ <time interval> where <time interval> is as defined in the Section 3.6. 
Immediate denotes immediate (without delay) notification on change detection. Best 
effort is defined as notify as soon as possible after change detection. Hence, best ef- 
fort is equivalent to immediate but will have lesser priority than immediate for notifi- 
cation. Interactive is a navigational style notification approach where the user visits 
the WebVigiL dashboard to retrieve the detected changes at his/her convenience. 

3.9 Compare Options 

One of the unique aspects of WebVigiL is its compare option and its efficient imple- 
mentation. Changes are detected between two versions of the same page. Each fetch 
of the same page is given a version number. The first version of the page will be the 

first page fetched after a sentinel starts. Given a sequence of versions Vj, V 2 V n , 

of the same page, the user may be interested in knowing changes with respect to dif- 
ferent references. In order to facilitate this, the change specification language allows 
users to specify three types of compare options. The syntax of compare options is: 
Compare <compare options> where compare options can be selected from a set P - 
(pairwise, moving n, every n}. 

Pairwise: The default is pairwise, which will allow change comparison between two 
chronologically adjacent versions as shown in Fig 2. 

Every n: Consider an example where the user is aware of the changes occurring on a 
page such as a web developer or administrator and is interested in the cumulative 
changes between only n versions. This compare option allows detecting changes 
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between versions V ; and V i+n . For the next comparison, the nth page becomes the 
reference page. For example if a user wants to detect changes between every 4 ver- 
sions of the page, the versions for comparing will be selected as shown in Fig 2. 




Moving n: This is a moving window concept for tracking changes. When a user wants 
to monitor the trend of a particular stock where meaningful change detection is only 
possible between particular set of pages occurring in a moving window. 

For moving n, If a user specifies the compare option of moving n where n=4, as 
shown in Fig 2, V : will be the reference page for V 4 . The next comparison will be 
between V-, and V 5 . 

WebVigiL believes in giving the users more flexibility and options for change de- 
tection and hence has incorporated several compare options for change specification 
along with efficient change detection algorithms. By default, the previous page (based 
on user-defined fetch interval where appropriate) and the current page are used for 
change detection. 

3.10 Change History 

The syntax of Change Flistory is ChangeHistory <n>. Change Specification language 
facilitates the user to specify the number of previous changes to be maintained by the 
system. User should be able to view last n changes detected for a particular request 
(sentinel). WebVigiL provides an interface to users to view and manage the sentinels 
they have placed. A user dashboard is provided for this purpose. Interactive option is 
a navigational style notification approach where the users visit the WebVigiL 
dashboard to retrieve the detected changes at their convenience. Through the 
WebVigiL dashboard users can view and query the changes generated by their senti- 
nels. Change history, mentioned by the user will be used by the system to maintain 
detected changes. 

3.11 Presentation 

Presentation semantics are included in the language to present the detected changes to 
users in a meaningful manner. In Example 1 Alex is interested in viewing the content 
cse5331 along with the context, but in Example 2 she is interested in getting a brief 
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overview of the changes occurring to the page. To support these, change specification 
language facilitates the users with two types of presentations. In change only ap- 
proach, changes, to the page along with the type (insert/delete/update) of change 
information are displayed in an HTML file using a tabular representation. Dual 
Frame approach shows both documents (involved in the change) on the same page in 
different frames side-by-side, highlighting the changes between the documents. The 
syntax is Presentation presentation options > where presentation options is speci- 
fied as presentation options> change-only \ dual-frame approach. 

3.12 Desiderata 

All of the above expressiveness is of not much use if they are not implemented effi- 
ciently. One of the focuses of WebVigiL was to design efficient ways of supporting 
the sentinel specification, provide a truly asynchronous way of notification and man- 
aging the sentinels using the active capability developed by the team earlier. In the 
following sections, we describe the overall WebVigiL architecture and the current 
status of the working system. The reader is welcome to access the system at 
http://berlin.uta.edu:8081/webvigil/ and test the usage of the system. 

4 WebVigiL Architecture and Current Status 

WebVigiL is a profile-based change detection and notification system. The high-level 
block diagram shown in Fig 3 details the architecture of WebVigiL. WebVigiL aims 
at investigating the specification, management and propagation of changes as re- 
quested by the user in a timely manner while meeting the quality of service require- 
ments [15]. All the modules shown in the architecture (Fig 3) have been implemented. 




Fig. 3. WebVigiL Architecture 
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User specification module provides an interface for the first time users to register 
with the system and a dashboard for registered users to place, view, and manage their 
sentinels. Sentinel captures the user’s specification for monitoring a web page. Veri- 
fication module is used to validate user-defined sentinels before sending the informa- 
tion to the Knowledgebase. The Knowledgebase is used to persist meta-data about 
each user and his/her sentinels. Change detection module is responsible for generat- 
ing ECA rules [16] for the run time management of a validated sentinel. 

Fetch module is used to fetch pages for all active or enabled sentinels. Currently 
fetch module supports fixed interval and best effort approaches for fetching the web 
pages. Version management module deals with a centralized server based repository 
service that retrieves, archives, and manages versions of pages. A page is saved in the 
repository only if the latest copy in the repository is older than the fetched page. Sub- 
sequent requests for the web page can access the page from the cache instead of re- 
peatedly invoking the fetch procedure. [3] discusses how each URL is mapped to a 
unique directory and how all the versions of this URL are stored in this directory. 
Versions are checked for deletion periodically and versions no longer needed are 
deleted. 




Fig. 4. Presentation using dual frame approach for a html page 



The change detection module [17] builds a change detection graph to efficiently 
detect and propagate the changes. The graph captures the relationship between the 
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pages and sentinels, and groups the sentinels based on the change type and target web 
page. Change detection is performed over the versions of the web page and the senti- 
nels associated with the groups are informed about the detected changes. Currently, 
grouping is performed only for sentinels that follow best effort approach for fetching 
pages. WebVigiL architecture can support various page types such as XML, HTML, 
and TEXT in a uniform way. Currently changes are detected between HTML pages 
using specifically developed CH-Diff [2] module and XML pages using CX-Diff [18] 
module. Change detection modules for new page types can be added or current mod- 
ules for HTML and XML page types can be replaced by efficient modules without 
disturbing the rest of the architecture. Among the change types discussed above in 
Section 0 all change types except Table, List and Regular expressions are currently 
supported by WebVigiL. 

Currently notification module propagates the detected changes to users via email. 
Presentation module supports both change-only and dual-frame approaches for pre- 
senting the detected changes. A screenshot of the notification using dual frame ap- 
proach for html pages is shown in Fig 4. This approach is visually intuitive and en- 
hances user interpretation since changes are presented along with the context. 

5 Conclusion and Future Work 

In this paper we have discussed the rationale for an expressive change specification 
language, its syntax as well as its semantics. We have given a brief overview of 
WebVigiL architecture and have discussed the current status of the system, which 
included a complete implementation of the language presented. We are currently 
working on several extensions. The change specification language can be extended to 
provide the capability of supporting sentinels on multiple URLs. The current fetch 
module is being extended to a distributed fetch module to reduce the network traffic. 
The deletion algorithm for the cached versions discussed in Section 0 is being im- 
proved to efficiently delete the no longer needed pages as soon as possible instead of 
the slightly conservative approach used currently. 
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Abstract. In this paper we present an approach for the interactive refinement of 
ontology-based queries. The approach is based on generating a lattice of the re- 
finements, that enables a step-by-step tailoring of a query to the current infor- 
mation needs of a user. These needs are implicitly elicited by analysing the 
user’s behaviour during the searching process. The gap between a user’s need 
and his query is quantified by measuring several types of query ambiguities, 
which are used for ranking of the refinements. The main advantage of the ap- 
proach is a more cooperative support in the refinement process: by exploiting 
the ontology background, the approach supports finding “similar” results and 
enables efficient relaxing of failing queries. 



1 Introduction 

Although a lot of research was dedicated to improving the cooperativeness of an in- 
formation access process [1], almost all of them were focused on resolving the prob- 
lem of an empty answer set. Indeed, either due to false presuppositions concerning the 
content of the knowledge base which lead to the stonewalling behaviour of the re- 
trieval system, or due to the misconceptions (concerning the schema of the domain) 
which cause mismatches between a user’s view on the world and the concrete concep- 
tualisation of the domain, when a query fails it is more cooperative to identify the 
cause of failure, rather than just to report the empty answer set. If there is no a cause 
per se for the query’s failure it is then worthwhile to report the part of the query which 
failed. Further, some types of query’s generalizations [2] or relaxations [3], [4] were 
proposed for weakening a user’s query in order to allow him to find some relevant 
results. 

The growing nature of the web information content implies a users behaviour’s 
pattern that should be treated in a more collaborative way in the modem retrieval 
systems: users tend to make short queries which they refine (expand) subsequently. 
Indeed, in order to be sure to get any answer to a query, a user forms as short as pos- 
sible query and depending on the list of answers, he tries to narrow his query in sev- 
eral refinement steps. Probably the most expressive examples are product catalogue 
applications that serve as web interfaces to the large product databases. The main 
problem here is that a user cannot express clearly his need for a product by using only 
2-3 terms, i.e. a user’s query represents just an approximation of his information need 
[5]. Therefore, a user tries in several refinement steps to filter the list of retrieved 
products, so that only the products which are most relevant for his information need 
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remain. Unfortunately, most of the retrieval systems do not provide a cooperative 
support in the query refinement process, so that a user is “forced” to change his query 
on his own in order to find the most suitable results. Indeed, although in an interactive 
query refinement process [6] a user is provided with a list of terms that appear fre- 
quently in retrieved documents, a more semantic analysis of the relationships between 
these terms is missing. For example, if a user made a query for a metallic car, then the 
refinements that include the value of the car’s colour of the car can be treated more 
relevant than the refinements regarding the speed of the car, since the feature metallic 
is strongly related to the colour. At least, such reasoning can be expected from a hu- 
man shop assistant. Obviously, if a retrieval system has more information about the 
model of the underlying product data, then a more cooperative (human-like) retrieval 
process can be created. 

In our previous work we have developed a query refinement process, called Librar- 
ian Agent Query Refinement process, that uses an ontology for modelling an informa- 
tion repository [7], [8]. That process is based on incrementally and interactively tailor- 
ing a query to the current information need of a user, whereas that need is discovered 
implicitly by analysing the user’s behaviour during the search process. The gap be- 
tween the user’s query and his information need is defined as the query ambiguity and 
it is measured by several ambiguity parameters that take into account the used ontol- 
ogy as well as the content of the underlying information repository. In order to pro- 
vide a user with suitable candidates for the refinement of his query, we calculate the 
so-called Neighbourhood of that query. It contains the query’s direct neighbours re- 
garding the lattice of queries defined by considering the inclusion relation between 
query results. 

In this paper we extend this work by involving more user’s-related information in 
the query refinement phase of the query refinement process. In that way our approach 
ensures continual adaptation of the retrieval system to the changing preferences of 
users. Due to the reluctance of users to give explicit information about the quality of 
the retrieval process, we base our work on the implicit user’s feedback, a very popular 
information retrieval technique for gathering user’s preferences [9]. From a user’s 
point of view, our approach provides more cooperative retrieval process: In each 
refinement step a user is provided with a complete but minimal set of refinements, 
which enables him to develop/express his information need in a step-by-step fashion. 
Secondly, although all users’ interactions are anonymous, we personalize the search- 
ing process and achieve the so-called ephemeral personalization by implicitly discov- 
ering a user’s need. The next benefit is the possibility to anticipate which alternative 
resources can be interesting for the user. Finally, this principle enables coping with a 
user’s requests that cannot be fulfilled in the given repository (i.e. the requests that 
returns zero results), a hard-solvable problem for existing information retrieval ap- 
proaches. 

The paper is organised as follows: In the second Section we present the extended 
Librarian Agent Query Refinement process and discuss its cooperative nature. Sec- 
tion 3 provides related work and Section 4 contains concluding remarks. 



2 Librarian Agent Query Refinement Process 

The goal of the Librarian Agent Query Refinement process [8] is to enable a user to 
efficiently find results relevant for his information need in an ontology-based infor- 
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mation repository, even if some problems we sketched in the previous section appear 
in the searching process. These problems lead to some misinterpretations of a user’s 
need in his query, so that either a lot of irrelevant results and/or only a few relevant 
results are retrieved. In the Librarian Agent Query Refinement process, potential 
ambiguities (i.e. misinterpretations) of the initial query are firstly discovered and 
assessed (cf. the so-called Ambiguity-Discovery phase). Next, these ambiguities are 
interpreted regarding the user’s information need, in order to estimate the effects of an 
ambiguity on the fulfilment of the user’s goals (cf. the so-called Ambiguity- 
Interpretation phase). Finally, the recommendations for refining the given query are 
ranked according to their relevance for fulfilling the user’s information need and ac- 
cording to the possibility to disambiguate the meaning of the query (cf. the so-called 
Query Refinement phase). In that way, the user is provided with a list of relevant 
query refinements ordered according to their capabilities to decrease the number of 
irrelevant results or/and to increase the number of relevant results. In the next three 
subsections we explain these three phases further, whereas the first phase is just 
sketched here, since its complete description is given in [8], 

In order to present the approach in a more illustrative way, we refer to examples 
based on the ontology presented in Fig. 1. Table 1 represents a small product catalog, 
indexed/annotated with this ontology. Each row represents the features assigned to a 
product (a car), e.g. product P8 is a cabriolet, its colour is green metallic and it has an 
automatic gear changing system. The features are organised in an isA hierarchy, for 
example the feature (concept) “BlueColor” has two specializations “DarkBlue” and 
“WhiteBlue” which means that a dark or white blue car is also a blue colour car. 



isA(SportsCar, Car), isA(FamilyCar, Car) 
isA(MiniCar, Car), hasFeature(Car, Feature) 
isA(Colour, Feature)*, isA(Luxury, Feature) 
isA(GearChanging, Feature), 
isA(Petrol, Feature) 
isA(MerchantFeature, Feature), 
isA(Price, MerchantFeature) 
hasColourV alue(Colour, ColourV alue) 
sub(ColourValue, ColourValue) 
hasColourType(Colour, ColourType) 
hasLuxuryType(Luxury, LuxuryType) 
hasGearChangingType(GearChanging, 
GearChangingT ype) , 



hasPetrolType(Petrol, PetrolType) 
spendLitters(Petrol, Value) 
hasPriceValue(Price, Value) 

ColourValue(“BlueColor”), ColourValue(“DarkBlue”) 
ColourValue(“ WhiteB lue”), 

Colour Value(“GreenColor”) 
sub(“BlueColor”, “DarkBlue”), 
sub(“BlueColor”, “WhiteBlue”) 
ColourType(“Metallic”), ColourType(“Standard”) 
ColourType(“Protected”), LuxuryType(“Cabriolet”) 
GearChangingType(“ Automatic”), 
PetrolType(“Diesel”) 



It means that a Colour is a type of Features. 



Fig. 1 . The car-feature ontology used throughout the paper 



2.1 Phase 1: Ambiguity Discovery 

We define query ambiguity as an indicator of the gap between the user’s information 
need and the query that results from that need. Since we have found two main factors 
that cause the ambiguity of a query: the vocabulary (ontology) and the information 
repository, we define two types of the ambiguity that can arise in interpreting a query: 
(i) the semantic ambiguity, as the characteristic of the used ontology and (ii) the con- 
tent-related ambiguity, as the characteristic of the repository. In the next two subsec- 
tions we give more details on them. 
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Table 1 . A product catalog based on Fig. 1 



2.1.1 Semantic Ambiguity 

The goal of an ontology-based query is to retrieve the set of all instances which fulfil 
all constraints given in that query. In such a logic query the constraints are applied on 
the query variables. For example in the query: 

V x <— Colour (x) and hasColorValue (x, BlueColour ) 
x is a query variable and hasColorValue (x, BlueColour) is a query constraint. The 
stronger these constraints are (by assuming that all of them correspond to the user’s 
need), the more relevant the retrieved instances are for the user’s information need. 

Since an instance in an ontology is described through (i) the concept it belongs to 
and (ii) the relations to other instances, we see two factors which determine the se- 
mantic ambiguity of a query variable: 

— the concept hierarchy: How general is the concept the variable belongs to 

— the relation-instantiation: How descriptive/strong are constraints applied to that 
variable 

Consequently, we define the following two parameters in order to estimate these 
values: 



Definition 1: Variable Generality 

VariableGenerality( X ) = Subconcepts(Type( X )) + l , where TypeiX) is the concept the 
variable X belongs to, Subconcepts(C ) is the number of subconcepts of the concept C. 



Definition 2: Variable Ambiguity 



VariableAmbiguity( X ,Q)~ 



| Relation(Type( X ))\ + 1 
| Assigned Re lationsf Type( X ),Q ) | + 1 



1 

| AssignedConstra int s( X, Q )\ - ^Assigned Re lationsf Type( X ),Q) | + 1 



( 1 ) 



where Relation(C) is the set of all relations defined for the concept C in the ontology, 
AssignedRelations(C,Q ) is the set of all relations defined in the set Relation(C) and 
which appear in the query Q. AssignedConstraints(X,Q ) is the set of all constraints 
related to the variable X that appear in the query Q. 

The total ambiguity of a variable is calculated as the product of these two parame- 
ters, in order to model uniformly the directly proportional effect of both parameters to 
the ambiguity. Note that the second parameter is typically less than 1. We now define 
the ambiguity as follows: 



Ambiguity ( X,Q) — VariableGenerality( X ) ■ Variable Ambiguity ( X,Q) 



(2) 
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Finally, the Semantic Ambiguity for the query Q is calculated as follows: 

SemanticAmbiguity ( Q)= ^ Ambiguitv( x.Q)’’ 

x^Var(Q) 

where Var(Q) represents the set of variables that appear in the query Q. By analysing 
these ambiguity parameters it is possible to discover which of the query variables 
introduces the highest ambiguity in a query. Consequently, this variable should be 
refined in the query refinement phase. 

2.1.2 Content-Related Ambiguity 

An ontology defines just a model how the entities from a real domain should be struc- 
tured. If there is a part of that model that is not instantiated in the given domain, then 
that part of the model cannot be used for calculating ambiguity. Therefore, we should 
use the content of the information repository to prune the results from the ontology- 
related analyses of a user’s query. 

2.1.2.1 Query Neighbourhood 

We introduce here the notation for ontology-related entities that are used in the rest of 
this subsection: 

— Q(O) is a query defined against the ontology O. The setting in this paper encom- 
passes positive conjunctive queries. However, the approach can be easily extended 
to queries that include negation and disjunction. 

— Q(O) is the set of all possible elementary queries (queries that contain only one 
constraint) for an ontology O. 

— KB{0) is the set of all relation instances (facts) which can be proven in the given 
ontology O. It is called the knowledge base. 

— A(Q(0)) is the set of answers (in the logical sense) for the query Q regarding the 
ontology O. 

Definition 3: Ontology-based information repository 

An Ontology-based information repository IR is the structure (R, O, aim), where: 

— R is a set of elements r x that are called resources, R= \ r i } , l</<n; 

— O is an ontology, which defines the vocabulary used for annotating these re- 
sources. We say that the repository is annotated with ontology O and a knowledge 
base KB(O); 

— aim is a binary relation between a set of resources and a set of facts from the 
knowledge base KB(O), aim <z RxKB(0). We write ann(r, kj, meaning that a fact 
k x is assigned to the resource r (i.e. a resource r is annotated with a fact k { ). 

Definition 4: Resources- Attributes group ( user’s request) 

A Resources-Attributes group (<;) in an IR=(R, O, aim ) is a tuple c = (Q',R ' ) , where 

— Q' c tf ( O) , is called a set of q_attributes, 

— R'cR , is called a set of g_resources. It follows: R'={re r\ V/e A(Q ) : ann(r,i)} , i.e. 
this is the set of resources which are annotated with all attributes of the query Q’. 
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Definition 5: Structural equivalence (=) between two user’s requests c,/,q 2 , is de- 
fined by: (Q x ' ,Ri )=(Q 2 ' ,R 2 ' )<r^R l '=R 2 ' . It means that two user’s requests are struc- 
turally equivalent if their sets of result resources ( ^resources ) are equivalent. 

Definition 6: Cluster of users’ queries (in the rest of text: Query cluster ) is a set of 
all structurally equivalent user’s requests A=(Q X ,R ) , where 

— Q x c 0(0 ) , Q x is called the set of A _attributes (attribute set) and contains the un- 

ion of attributes of all requests that are equivalent. For a user’s request q u it is 
calculated in the following manner Q x ={ U £) attributes } . 

v<r,- St=c u 

It holds: Vt| ,( 2 Ci = C2 A (Ci _ attributes c Q x ) — > C2 attributes ci Q x . 

— R y ^R, R v is called the set of A jresources (resource set) and is equal to the 

resources set of the query Q x . Formally: R v ={re r\ Vie A(Q X ) -> ann(r,i )} . 

The Query cluster which contains all existing resources in IR (i.e. a cluster for 
which R =R) is called the root cluster. The set of all Query clusters A is denoted by 
A (IR). 

Definition 7: Structural subsumption (parent-child relation) (<) between two query 

clusters is defined by: (Q xl ,R yl )< (Q xl ,R yl )^ R yl c= R yl or (Q xl ,R yl ) <(Q x2 ,R y2 ) ^ Q xl 3 Q i2 ■ 

A Query cluster \ 2 subsumes another cluster A, if the set of resources of A 2 is a 
superset of the resource set of the cluster A; , or if the set of attributes of A 2 is a subset 
of the set of attributes of A/ . Note that this relation is irreflexive, anti- symmetric and 
transitive. 

We define a special subsumption relation (< dir ) on the set of Query clusters: 

A, < dir A 2 -H>A,< A 2 A-aA^A, < A; < A 2 . 

In that case we call A 2 a direct_parent cluster of A, and A ; a direct_child cluster of 
A 2 . The sets of direct_parent and direct_child clusters of a Query cluster A/ are cal- 
culated in the following way: 

DirectParents{ A[ )={ A, e A(IR)\ R xl c R yi a^3Aj.R vI c R yj - a R yi /\3qe Q yi /\qt Q rest -i }, 

where Q resUi is the set of all query terms that belong to direct_parents clusters, exclud- 
ing the cluster A ; and 

DirectChildren( A, ) = { A,-eA('/«j| O xl cgj, a-3Aj,Q x1 <^Q x j c2 t/ a3;-e R vi /\rt R rest ^ ), 

where R . is the set of all resources that belong to direct_children clusters, exclud- 
ing the child A ( . . 

The conditions regarding rest-i sets in above-mentioned definitions ensure the 
minimal cardinality of the both direct neighbour sets. 

Moreover the partial order of Query clusters induces many properties that are im- 
portant for the efficient manipulation of user’s queries: 
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Definition 8: Structural-similarity (siblings relation) (~) 

We define two kinds of structural similarity between two Query clusters Ay and A 2 : 

— COmmon_parent (~p arenL )- Ay ~ parent A 2 = '\y Ay < t y/ r A j A A 2 <dir A / Ay = A 2 ) 

Common — child ( child) * ~child A 2 3A/ Ay < ( // a Ay A Ay <jy / - At a if* Ay = At) . 

Definition 9: Disjoint Query clusters ( <> ) are clusters which have no resources in 
common, i.e. Aj <> A 2 <-4 R y \ nR y 2 = {} ■ 

Definition 10: Query Neighbourhood 

The Neighbourhood of a user’s request q u is the structure N := (E, P, C, y, rj), where 

— E:= /q | q = c.„ } , i.e. the set of all user’s requests equivalent to q u . Consequently, 
all requests from E form the starting cluster A starl (q u ) for q u , i.e. E = A slart (q u ). 

— P:= { A | A slart (q u ) < dir A ,. } , i.e. the set of all direct_parent clusters of A start (q u ) . 

— C: - / Ay I Ay < iir A slarl (q u )} , i.e. the set of all direct_child clusters of A Mn (cj . 

— y.:P— >91, is the relevance function for direct_parent clusters (it is used for ranking 
direct_parent clusters) and 

— 77 : C— >91, is the relevance function for direct_child clusters (it is used for ranking 
direct_child clusters) (91 denotes the set of real numbers) 

The ranking of direct_child (direct_parent) clusters is discussed in section 2.3.1. 

2.1.2.2 Quantifying Content-Related Ambiguity 

We define several properties which characterise the content ambiguity of a query [7], 
but due to the lack of space we focus here only on two which are used in the subse- 
quent sections. 

Max_equal_request of a user’s request g a is the set of attributes found in its largest 
(regarding g_attributes ) structurally equivalent request, c , amax . is equal to the 

starting cluster A stan (q a ) (see Definition 10), or A fl as a shorthand here. This cluster is 
calculated as A a =(Q m ,R ya ) so that Q m 3 q\ a ^3 a,,a„ < dir a ,,Q xi 3 q\ . Note that Q a ' is 
the set of attributes in the query g a . 

Min_equal_request is the set of attributes found in the smallest (regarding 
g_attributes) equivalent request for the given user’s request, . There can be 
several such groups. They are calculated in the following way: 

WQti^Qa' ))\ A fl < dir A,.,i = l, ..n } , where A„ =A start (q a ). 

2.2 Phase 2: Interpretation of Query Ambiguities 

The previously defined parameters estimate the ambiguity of a query regarding the 
underlying ontology and the information repository. However, the problems in the 
meaning of a query have to be analysed regarding the user’s needs, i.e. regarding the 
resource(s) the user is searching for. 

2.2.1 Importance of a Query’ Constraint for a User’s Need 

Despite the fact that in most IR systems, users do not quantify the importance of a 
constraint (attribute) in a query, a user does have his local preferences regarding these 
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constraints, i.e. some constraints in a user’s request are more relevant for his need 
than other. However, the crucial problem is how to discover the differences in user’s 
preferences regarding a request, without to force a user to give explicitly a feedback 
about those preferences. In our approach we rely on the so-called implicit relevance 
feedback, a very popular information retrieval technique for gathering user’s prefer- 
ences [9]. However, we expand and extend these methods according to the different 
nature of our ontology-based querying process. We define two parameters that de- 
scribe the importance of a constraint for the user’s need: Interpretability and Actual- 
ity. 

Interpretability . The general assumption is that a user forms a query according to his 
current information need, i.e. that all parts of the query correspond, to some extent, to 
his need. However, it is possible that a query is interpreted in a repository in an unex- 
pected manner, i.e. the execution of a query can be interpreted ambiguously: either as 
a more general or a more specific query. For example, if a user is searching for a 
Cabriolet-car against the given product catalog (Table 1), due to the content of the 
underlying dataset, the system will retrieve the same results as for the query T’Cab- 
riolet. Metallic ”, which is the Max_equal_request (See Section 2. 1.2.2.) for the re- 
quest ?”Cabriolet”. It means that the constraint Metallic is interpreted as an additional 
constraint in the user’s query, but it might be not aligned to his need. Consequently, 
the importance of that parameter for refining should be reduced. Therefore if a con- 
straint is a part of the M ax _equal_re quest of a user’s query, but not of that query, then 
its Interpretability is very low (= 0). Otherwise if a user is searching for a Cabriolet- 
and Metallic-car, the system will retrieve the same results as for the query T’Cabrio- 
let” or for the query ?” Metallic”, which can be interpreted as the “ignorance” of one 
of these constraints in the searching process. Consequently, the importance of these 
constraints should be increased. Since the reduced queries belong to the 
Min_equal_requests of the initial query, by considering the constraints from the initial 
query that are not contained in the Min_equal_requests we get the set of constraints 
with a high Interpretability (=1). Therefore: 



where c is a constraint from a user’s query g a . 

Actuality. The Actuality parameter reflects the phenomena, accounted in the IR re- 
search [10], that a user may change the criteria about the relevance of a query term, 
when encountering newly retrieved results. In other words, the constraints most re- 
cently introduced in a user’s query are more indicative of what the user currently 
finds relevant for his need. We model it using the analogy to the ostensive relevance 
[10]: 



the current query session and num_session{c, Qs) is the number of refinement 
steps, which the constraint c is involved in. 




Actuality ( c,Qs) = 



num _session( c,Qs) + 1 



where c is a constraint (A_attribute), Qs is 
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2.2.2 Relevance of a Query Constraint for a User’s Need 

The theory about the implicit relevance feedback postulates that if a user selects a 
resource from the list of retrieved results, then this resource corresponds, to some 
extent, to his information need. However, a click on a particular resource in the re- 
sulted list cannot be treated as an absolute relevance judgement, since users typically 
scan only the top / ranked (/ ~10) resources. For example, maybe a document ranked 
much lower in the list was much more relevant, but the user newer saw it. It appears 
that users click on the (relatively) most promising resources in the top l, independent 
of their absolute relevance. However, if we assume that a user scans the list of results 
from top to bottom, the relative relevance is evident: all non-clicked-on resources 
placed above a clicked-on resource are less relevant than the clicked-on resource. 
Obviously, the relevance is related to some feature that are contained in the clicked- 
on resource and not contained in non-clicked-on resources. It means that by analysing 
the commonalities in the attributes of results a user clicked/not clicked, we can infer 
more information about the intension of the user in the current query session. In order 
to achieve this, we define the relation preferred ranking as 

At < re * R. for all pairs 1 <= j < i, with i e C and j £ C, 

where (R v R 2 , A,, ...) is a ranked list of resources, set C contains the ranks of the 
clicked-on resources and Q is the posted query. 

ImplicitRelevance. By analysing the difference between the features (attributes, con- 
straints) of the clicked-on and non-clicked-on resources we get a set of so called Pre- 
ferred constraints for a query Q in the following manner: 

Preferred(Q) = { con \ con e uj Preferredj(Q)} , where 
Preferredj(Q)= {el \ ele Attr(R-) \ 'U\Attr(Rf Vi At < r Q* Aj}, 

Attr(R x ) is the set of constraints (attributes) that are defined for the resource A x . 

Therefore, the set of constraints that seems to be relevant for a user in a query Q s 
can be calculated as: 



where num_sessions{c, Qs, i) is the number of refinement steps which the constraint c is 
involved as a preferred constraint in, whereas n is the total number of times the con- 
straint c is treated as a preferred constraint in the current session Qs. 

In this way we decrease the likelihood that a preferred constraint is still relevant if 
it was suggested as relevant in a previous refinement step, but was not selected by the 
user as relevant in the subsequent refinement step (see (3)). Therefore, our approach 
has the self-improvement nature - it learns from its failures. 

Implicitlrrelevance. Since the recommended refinements are presented to a user in 
the decreasing order of relevance, one can assume that if a user has selected n-th 
ranked results, then the first n - 1 ranked results (constraints) are wrongly ranked on the 
top of the list of the refinements. We call these constraint implicit irrelevant. They are 
calculated in similar manner as implicit relevant constraints: 




-,c <e Pr ef err ed( Qs ) 9 



c £ Pr ef err ed( Qs ) 



(3) 
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Im pllrrel( c,Qs) = 




1 

num sessions ( c,Q s ,i ) 



c € Nonpreferred (Qs) 
-,c e Nonpreferred ( Qs ) •> 



(4) 



where NonPreferred(Q) = {con | con e Uj NonPreferred^iQ)} 

NonPreferredj{Q)= {el|ele UiAttr(/? ( .)\Attr(/?.), Vi R. < r Q* R } I ■ 

The definition of num_sessions{c, Qs, i) is analogue to (3), but regarding implicit irrele- 
vance. 

Similar to (3), formula (4) enables the correction of false assumptions (regarding 
the preferences of the current user) made in the ranking process. 

Formulas (3) and (4) ensures the self-adaptivity of the ranking system, e.g. they do 
not allow that the system repeatedly ranks a non-interesting refinement highly. Fi- 
nally, the calculated implicit relevance of a refinement c looks like: 



ImplRel(c,Qs )+ 1 

Relevance ( c,Os) = — — 

Im pllrrel( c,Qs) + 1 



(5) 



The previously defined parameters estimate the ambiguity of a query regarding the 
underlying ontology and the information repository. However, the problems in the 
meaning of a query have to be analysed regarding the user’s needs, i.e. regarding the 
resource(s) the user is searching for. 



2.3 Phase 3: Query Refinement 

This is the last phase in the query refinement process, which processes previously 
defined ambiguity parameters in order to help a user in finding more relevant re- 
sources in a shorter period of time. Primarily, we provide three types of support: 

1. Recommending query refinements that can lead to better fulfilment of the user’s 
need; 

2. Recommending resources which can be treated as alternatives for the resource a 
user selected for further processing; 

3. Recommending modifications of the query in case of zero results. 

2.3.1 Ranking of Recommendations (Refinements) 

The presented approach supports step-by-step query refinement, which means that a 
user is provided with all and only relevant refinements. In other words, the set of 
provided refinements is (1) minimal and (2) complete regarding the relevant re- 
sources. Due to the lack of space we give the proof only for the first one statement. 

Proof for (1): Let us assume that for a user’s query A start (£ u ) (see Definition 10), there 
is a refinement A x that belongs to the set of children C: - (A, | A,. < dir A start (q u )}, but 
that could be omitted from the children without disabling the retrieval of some rele- 
vant resources. It means that there are some refinements from the set C, let us assume 
A a and A b , that retrieve parts of the R x (results of the refinement A x ), let us say R xa and 

R xb , respectively. Therefore, R a <z R xa , R b a R xb and R xa u R xb = R x . According to the 
formula for the calculation of the set DirectChildren (below Definition 7) this is not 
possible, since each refinement from C has to have a unique element in its list of re- 
sult. It means - the starting assumption is false . 
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In the step-by-step query refinement we assume that with every decision taken by 
the user can be used as information about the user’s intensions and background. For 
example, each transition (refinement) from the query e to query d involves a con- 
scious choice by the user by preferring the constraints contained in d to the other 
options in the constraint e. The probability of a constraint d being the target of the 
navigation process is related to the distance fluctuations while travelling the naviga- 
tion path. Recent fluctuations have more impact then past fluctuations. Typically, the 
farther removed a constraint is from the path, the less likely that it is the target of the 
search. This is captured by the probability of d being the search target, after having 
traversed search path p. For this purpose we will transform the search space into a 
transition network, allowing the use of Markov chains theory. First, the set of states is 
defined as the set of query constraints augmented with a special state called stop. This 
state represents the termination of the search process. The transition between states 
are defined as follows: if constraints x and y are connected regarding the ontology 
structure, then they are connected in the transition network. 

We assume for each transaction e a probability q e > 0 of occurring. In a transition 

network, for each state a: ^ </«->/, = 1 ■ The transition matrix T is defined as 



where q s is uniformly distributed over all related constraints and Related(c) is a func- 
tion that retrieves constraints that are related to c regarding the underlying ontology. 
For example, car and Motorbike are related to each other through the superconcept 
Vehicle. 

Further, T’(x,y ) is the probability of reaching v starting from x in i transitions. T° = 
/, the identity matrix. Thus, the sum ^T'( e,d) is the probability of reaching d from e 



in any number of steps. Next we focus on the probability Pr(d \ e) of a constraint d 
being the search target of a navigation path. The destination probability for d after 

traversing a path from e is defined as follows: Pr(d \e) = Y_T'(e,d )-T(d,stop) . 



The infinite sum converges to ( I-T)~ l (e,d ) (see [11]). This requires the calculation 
of the inverse matrix of an |Q(0)| x |£2(0)| matrix, where 0.(0) is the set of elemen- 
tary constraints regarding given ontology O. Note that this operation can be calculated 
off-line. 

However, in a step-by-step refinement a user is navigating through queries Q t that 
contain several constraints. In that case we expand the probability to include this in- 
formation as follows: 



where d and e are constraints from queries Q d and Q e respectively. | Q d \ depicts the 
number of constraints in Q d . 

We use this destination probability function as a starting point for computing 
which neighbours bring the user most direct towards the highest probable destination 
constraint. Indeed, as we mentioned above the refinement process depends on the 



b:a^>b 



T(x,y) = { 



q s if xeRelated(y) 
0 otherwise 
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searching history (parameters Actuality and Relevance) and the content of the reposi- 
tory (parameters Interpretability). 

This is formalized by assigning the coefficient Rank to each query Q d that belongs 
to the lattice of the refinement for the query Q : 



Rank( Q d 



a, -|i Zm 'S w 






\Qe\ 



<^Qe 



A- Relevancefd )■ Actuality(e)+ Interpretability(e,Q e ) 

g> ' ( J+\ ' ’ 



where X is a forgetfulness coefficient that model the impact on the past user’s behav- 
iour on the ranking process: X - 0 - the past is forgotten, X < 1 - the past carries less 

weight then the present, usually X = Vi. 

Therefore our approach prioritises highly relevant refinements, i.e. the refinements 
that are related to the very characteristic (regarding a query) constraints and tailored 
to the user’s need. 



2.3.2 Finding Alternative Results 

An alternative result for a query is a result that does not fulfil all constraints from the 
query perfectly (but fulfils some of them), but has (many) suitable features from a 
user’s point of view (e.g. usability). In the domain of e-commerce such features are 
often called merchant characteristics (like price of a product, warranty, time of deliv- 
ery). However the problem is how to find which characteristics can be relaxed in the 
user’s query and which not. Our approach supports this decision by analysing implic- 
itly discovered user’s need. The query neighbourhood is used as a snapshot of the 
repository where relevant alternatives can be found. Briefly, we define the function 
Alternatives{R p , M, level), which finds all resources from the neighbourhood of the 
cluster in which the user selected the resource R p , for which the merchant characteris- 
tic M is in the offset level of the value for the resource R p . The system checks all re- 
sources contained in sibling clusters which satisfy the offset of the selected merchant 
characteristic. A user should set offset according to his preferences. The default value 
is 0,1. Formally, 

Altematives(Rp,M , level ) = { Rpi \ R p , c R p )-M( R p j)\< level} , 

VA,-, A,~A p 

where A is the current cluster that contains the product R p selected by the user. The 
alternative results are ranked according to the relevance to the current elicitated user’s 
need. Moreover, using the parameter Importance, our approach “knows” which at- 
tributes are relevant for the user’s need and tries firstly to find alternative products 
that retain these attributes. 



2.3.3 Resolving Failing Queries 

A query is said to fail whenever its evaluation produces the empty set. An empty 
answer is surprising to the user since the user expects that there exist answers to the 
asked query. So when a query fails, a system could be more cooperative by helping to 
trace the reason for the query’s failure, or at least to pinpoint to the failure. However, 
the main problem is that it is possible to have a huge number of (independent) causes 
of a failure in a query. Consequently, finding all of them can require exponential time 
in worst case. Therefore, an efficient “repairing” system has to select only the most 
relevant repairs of a failing query. In our approach the relevance of a repair is calcu- 
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lated according to the lost in information that is caused by replacing/eliminating a 
query constraint. In that way we ensure that the interpretation of the new (satisfiable) 
query is close to the user’s information need expressed in the original query. 

The degradation of a user’ s information need caused by eliminating a constraint C i 

from his initial (failing) request q , is proportional to the ambiguity (see Section 2.1.1) 
introduced by this constraint and inverse proportional to the number of query’s con- 
straints that are related to C ; : 



InfContentf C t ,g x ) = Ambiguity ( r, Q x ) 

reVarfC, ) 



1 

{flfeQx a feRelated(Ci)}\’ 



where q x is a user’s request in the form { Q x , R x } . Var(c) is the function that returns the 
variable contained in the constraint c and a , b, d are weighting factors which are by 
default set to 1 . Related(c) is a function that retrieves constraints that are related to c 
regarding the underlying ontology. Particularly, we treat the constraints defined on the 
same domain as related to each other, e.g. the constraints Metallic and BlueColour are 
related since the both of them are defined on the concept Colour (see Fig. 1). 

The following procedure searches for the most suitable repairs regarding a user’s 
query: 

Let assume that a query q has n constraints. 

Step 1. Create n subqueries by eliminating (each) one constraint from the query. 
Evaluate these queries. 

If one of these subqueries does not fail Then go to Step2. 

Else Calculate the InfContent for each constraint and select the subquery that corre- 
sponds to (i.e. misses) the least informative constraint. Repeat Stepl starting 
with that subquery. 

Step 2. 

If more subqueries did not fail, 

Then Calculate the InfContent of the missing constraint in each subquery and select 
that subquery which corresponds to the least informative one (it is most close to 
the original query), e.g. 

Calculate the starting cluster A start (cf) and determine all direct_child clusters of that 
cluster. 

If one of these direct_child clusters (e.g. A k ) contains an attribute (constraint), that 
is different from A start (cf) but it is in relation to the constraint eliminated by gener- 
ating ^j, then that cluster is a candidate for the query repair. Formally, Candi- 
dates^, <^) = {x | x e Q a ax £ q a 3R a { Q a , RJ e DirectChildren{A slart (qf) Are 
Related^ y) a y e { q \ ^ } } . These candidates are ranked according to the ontology- 
based similarity (function Related ) between a candidate and the eliminated con- 
straint. 

Example: Let us suppose that a user is interested in a “Cabriolet” car that has an 
“Automatic” gear changing system and has “DarkBlue” colour. However, in the un- 
derlying product catalog (Table 1) there is no car which fulfils these constraints. Ac- 
cording to the previous algorithm: q = ?“Cabriolet” +“DarkBlue” + “Automatic”, Q = 
“Cabriolet” + “Automatic” and Candidates(q, i^j) = {“WhiteBlue”}, since “White- 
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Blue” and the eliminated constraint, “DarkBlue”, are related (i.e. have a common 
parent). Therefore, the new query is ?“Cabriolet” +“WhiteBlue” + “Automatic”. It 
conveys the initial user’s need very well. 

Since our approach finds only several, most relevant, repairs, the complexity of the 
approach is low. We perform a depth-first search, such that in at most n steps in is the 
number of constraints) we find a non-failing subquery. In the r'-th step from the begin- 
ning we perform {n - i ) tests in order to find the most suitable subquery. Therefore the 
worst time-complexity of the Stepl is O (n 2 ). The complexity of the Step2 is equal to 
the complexity of finding direct_child clusters of a query that is discussed in the next 
section. 

3 Related Work 

1. Ontology-Based Recommender Systems. Our system can be considered as a 
content-based recommender system. From that point of view we see three main ad- 
vantages of applying our approach: (i) basically, our system does not provide just a 
ranked list of results, but it recommends the set of possible refinements of the initial 
user’s request as well, (ii) by using an ontology for indexing an information reposi- 
tory, our system can discover some implicitly presented similarities between informa- 
tion resources and therefore can provide better, semantic-based recommendations and 
(iii) by analysing the users’ implicit relevance feedback, our system learns on-line 
(short-term) users’ profiles, avoiding capturing and processing large log-files. Due to 
the ontology-based backbone of our system, it can be seen as a prototype of a recom- 
mender system for the Semantic Web, i.e. as an example how traditional on-line 
shopping portals can benefit from moving to the Semantic Web. The benefits regard- 
ing the integration between various product catalogs, e.g. using a common ontology, 
are not discussed here. A similar, ontology-based, recommender is the Quickstep 
system [12], which recommends on-line research papers to the academic community. 
However, it is primarily focused on using an ontology for profile bootstrapping in 
order to overcome the cold start problem by generating recommendations for a novel 
user. 

2. Cooperative Answering / Personal Assistant / Interface Agents. From an end 
user point of view, our approach can be seen as a personal assistant [13] who analyses 
the user’s behaviour and interacts with him in order to provide more relevant solu- 
tions for his initial task. Our approach differs from existing approaches for such a 
cooperative answering in two directions: (i) the cooperation is based not only on 
processing the query a user posts, but rather on understanding what the user is search- 
ing for (i.e. a query is treated as an approximation of a user’s need) and (ii) the coop- 
eration helps not only in resolving failure situation (no result for a query), but in sug- 
gesting the user some better alternatives for the given results. 

3. Concept Lattice. Conceptually, the most similar approach to our query refinement 
system is Query By Navigation QBN [14], an approach for the navigation through a 
hyperindex of query terms. The hyperindex search engine supprts users to add, delete 
or substitute a term from the initial query by providing the minimal query refine- 
ments/enlargements. However, the variety of the analyses, especially regarding the 
ambiguity of the query and the query equivalence, is missing. Visualization similar to 
ours can be found in the REFINER system [15], which combines Boolean information 
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retrieval and content-based navigation with concept lattices. For a Boolean query 
REFINER builds and displays a portion of the concept lattice associated with the 
documents being searched centred around the user’s query. The cluster network dis- 
played by the system shows the result of the query along with a set of minimal query 
refinements/enlargements. However, it (as QBN) does not use an ontology as back- 
ground what limits its possibilities for proposing the refinements. Moreover, it does 
not tailor refinements to a user’s need. 

4 Conclusion 

In this paper we presented a comprehensive approach for the refinement of ontology- 
based queries, which takes into account problems reported in traditional information 
retrieval. It extends our previous work in the ontology-based query refinement by 
involving more users’ related information in the query refinement process. In that way 
our approach provides more cooperative retrieval process: In each refinement step a 
user is provided with a complete but minimal set of refinements, which enables him to 
develop/express his information need in a step-by-step fashion. Secondly, although all 
users’ interactions are anonymous, we tailor the querying process to a user’s prefer- 
ences by using implicit user’s feedback. Next, it is possible to anticipate which alter- 
native resources can be interesting for the user. Finally, this principle enables coping 
with a user’s requests that cannot be fulfilled in the given repository (i.e. the requests 
that returns zero results), a hard-solvable problem for existing information retrieval 
approaches. 

Applied on Web searching our approach enables a new paradigm for presenting re- 
sults to users in the Semantic Web by meta-organizing the retrieved list of results, by 
clustering these results in the groups which are ranked according to the relevance for 
the user’s need. 
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Abstract. The explosive growth of spatial data worldwide coupled with 
the emergence of GRID computing provides a strong motivation for de- 
signing a spatial GRID which allows transparent access to geographically 
distributed data. While different types of queries may be issued from any 
node in such a spatial GRID for retrieving the data stored at other (re- 
mote) nodes in the GRID, this paper specifically addresses spatial join 
queries. Incidentally, skewed user access patterns may cause a dispropor- 
tionately large number of spatial join queries to be directed to a few ‘hot’ 
nodes, thereby resulting in severe load imbalance and consequently in- 
creased user response times. This paper focusses on load-balanced spatial 
join processing in a spatial GRID. 



1 Introduction 

The explosive growth of spatial data worldwide coupled with the prevalence of 
spatial applications has made efficient management of geographically distributed 
spatial data a necessity. Spatial applications often arise in town planning, car- 
tography, resource management, GIS (Geographic Information Systems), CAD 
(Computer-Aided Design) and computer vision. Incidentally, the emergence of 
GRID computing [4], which is associated with the massive integration and vir- 
tualization of geographically distributed computing resources, provides a strong 
motivation for designing a spatial GRID [11] which allows transparent access 
to geographically distributed data. While different types of queries (e.g., spatial 
select queries 1 , nearest neighbour queries, similarity search queries and spatial 
join queries) may be issued from any node in the GRID for retrieving the data 
stored at other (remote) nodes of the GRID, this paper specifically addresses 
spatial joins on remote data since such queries constitute a typically expensive 
as well as popular class of query in spatial databases. Incidentally, a spatial join 
query retrieves from two spatial relations all the tuple pairs satisfying a given 
spatial predicate. 

Now let us understand the importance of optimizing remote spatial joins with 
the help of an example. This year’s Olympic Games is expected to attract tens of 

1 Our previous work [11] studied load- balancing of spatial select queries in a spatial 
GRID. 
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thousands of visitors to Athens and many of these visitors would possibly wish 
to find a hotel near a bus station for the purpose of convenient transportation. 
Such visitors may issue the following query from their respective home countries 
which may be quite far away from Athens: Find all the hotels in Athens which are 
near to any bus station. Assuming there are two relations ‘Hotels ' (containing 
details, such as location, rental charges, of all the hotels in Athens) and ‘Bus 
Stations’ (containing information concerning all the bus stations in Athens), this 
translates to a remote spatial join operation. Interestingly, the above scenario 
is also equally applicable to any major international event that attracts people 
from various countries. Notably, the recent trend of increased globalization has 
significantly increased the importance as well as the performance demands of 
such global applications. Unfortunately, the current state-of-the-art does not 
allow a user to perform this kind of operation efficiently. 

Skews in initial data distributions, skewed user access patterns and chang- 
ing popularities of data regions may cause a disproportionately large number 
of spatial join queries to be directed to a few ‘hot’ nodes, thereby resulting in 
severe load imbalance and consequently increased user response times. From our 
example, given the huge number of queries from potential visitors to Athens, the 
nodes containing the data of Athens would quickly become overloaded, thereby 
necessitating a load-balancing mechanism for processing remote spatial joins ef- 
ficiently. Several factors such as wide-area communication overheads, node het- 
erogeneity and lack of centralization make the problem of load-balanced spatial 
join processing in GRIDs significantly more complex than that of load-balanced 
spatial join processing in traditional distributed environments such as clusters. 
However, we believe that the time has come to deal head-on with this problem. 
The main contributions of our proposal are as follows. 

— We present a dynamic data placement strategy involving online data repli- 
cation in GRIDs, the objective being to bring the data closer to the node 
from which it is frequently queried. 

— We propose a novel load-balancing strategy for speeding up spatial joins in 
GRID environments. 

Our performance evaluation demonstrates the effectiveness of our proposed ap- 
proach in reducing the response times of spatial joins in GRIDs. To our knowl- 
edge, this work is one of the earliest attempts at addressing the load-balancing 
of remote spatial joins via online data replication in GRID environments. The 
remainder of this paper is organized as follows. Section 2 discusses related work, 
while Section 3 presents the system overview. Issues concerning load-balancing 
in spatial GRIDs are presented in Section 4. The proposed strategy for load- 
balancing spatial joins in GRIDs is discussed in Section 5, while Section 6 reports 
our performance evaluation. Finally, we conclude in Section 7. 



2 Related Work 

Important ongoing GRID computing projects such as the European DataGrid 
[2], the Grid Physics Network (GriPhyN) [13] and the Earth Systems Grid (ESG) 
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[6] aim at efficient distributed handling of huge amounts of data (in terabyte or 
petabyte range). 

Issues concerning spatial databases with GIS applications can be found in 
[15], while a comprehensive survey on spatial indexes has been presented in [5]. 
Parallel spatial join processing has been extensively researched in the traditional 
domain. The proposals in [7] and [1] discuss synchronous traversal of gener- 
alization trees and R*-trees respectively. The work in [1] investigates parallel 
load-balanced, spatial join processing using R*-trees on a shared- virtual- memory 
architecture. The PBSM (Partition Based Spatial-Merge) algorithm [12] first 
partitions the inputs into smaller chunks and uses a computational geometry 
based plane-sweeping technique to obtain a set of candidate pairs and then the 
tuples corresponding to the candidate set are fetched from disk to determine 
whether the join condition is actually satisfied. The work in [10] proposes a 
parallel non-blocking spatial join algorithm which uses duplicate avoidance and 
addresses main memory issues. Several dynamic [16] load-balancing techniques 
for clusters have also been proposed specifically for clusters. Notably, neither the 
the existing spatial join techniques nor the existing load-balancing strategies con- 
sider GRID-related issues such as heterogeneity and wide-area communication 
overheads essentially because these issues do not arise in traditional environ- 
ments. 

Incidentally, our proposal amounts to some form of caching of the results. 
Recent works on caching include [9,14]. The work in [9] analyzes the effects of 
different design choices involving cache structure, cache capacity, and timeouts 
for caching previously discovered routes in demand routing protocols for wireless 
ad hoc networks. In [14], a semantic caching scheme has been proposed for 
accessing location-dependent data in mobile environments. 

3 System Overview 

We envisage the spatial GRID as comprising several clusters, where each cluster 
comprises nodes that belong to the same Local Area Network (LAN) [11]. This 
facilitates the separation of concerns between intra-cluster and inter-cluster load- 
balancing issues. Given that intra-cluster issues have been extensively researched, 
this paper specifically focusses on inter-cluster issues. For each cluster, the most 
reliable and best administered node is selected as the cluster leader, any ties 
being resolved arbitrarily. A cluster leader’s job is to coordinate the activities 
(e.g., load-balancing, searching) of the nodes in its cluster. We define distance 
between two clusters as the communication time r between the cluster leaders 
and if the value of r for two clusters is less than a pre-defined threshold, the 
clusters are regarded as neighbours. A cluster £) is considered to be relevant to 
a query Qi if C t contains at least a non-empty subset of the answers to Q. Given 
that the number of queries waiting in node IV)’ s job queue is IT, and taking the 
heterogeneity in node processing capacities into account, we define Tat,, the load 
of N-i, as follows: 

L Ni = Wi X ( CPU Ni -T- CPUTotal ) (1) 

where CPUNi denotes the CPU power of and CPUTotal stands for the total 
CPU power of the cluster in which IV, is located. The load of a cluster is calcu- 
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lated as Y2 Ln, i.e, the sum of the loads of its individual members. Given that 
the loads of two clusters C) and Cj are Lc t and Lc d respectively and assuming 
without loss of generality that Lc t > Lq j, the normalized load difference A 
between C, and Cj is computed as follows: 

A = ((L Ci x TCi) - (L Cj x TCj))/(TCi + TC 3 ) (2) 

where TCi is the sum of the CPU power of all the members of Cj. Similarly, 
TCj is the sum of the CPU power of all the members of Cj. 

Spatial indexing mechanisms may vary across clusters. Hence, we propose 
a generalized indexing scheme which is built on top of the existing index at a 
cluster node. The indexing scheme in each cluster comprises three index struc- 
tures, namely IID (Index structure for Internal Data), IED (Index structure for 
External Data) and IRD (Index structure for Replicated Data). Note that we 
distinguish between a cluster’s internal data (the data which is originally stored 
at a cluster) and its replicated data because it may not be possible to integrate 
the replicated data smoothly into the existing index structure (for the internal 
data) of the cluster as the replicated data may be far apart in space from the 
cluster’s internal data. Such separation of concerns between internal data and 
replicated data also makes it easier to periodically delete infrequently accessed 
replicas for optimizing disk space usage. IID is a generalized two-tier indexing 
mechanism, the first-tier of which resides at the cluster leader Ci and is essen- 
tially a list, each entry of which is of the form (region, node-id ) , where region 
represents a specific region and node-id stands for the node in the cluster at 
which the region is located. At the second tier, every node has its own indepen- 
dent index structure for the data allocated to it. For example, in case of R.-trees 
[8], rectangular-shaped regions would be stored in the first-tier at Ci, while the 
second-tier would comprise R.-trees at the individual nodes. IED and IRD are 
hierarchical tree-based index structures, which reside at the cluster leader. In 
our example, IED/IRD would be R.-tree-like structures except that their leaf 
nodes would contain cluster IDs of neighbouring clusters’ data regions instead 
of pointers to objects in the database. Updates to IED/IRD are periodically ex- 
changed between neighbouring cluster leaders preferably via piggybacking onto 
other messages. Notably, in this paper, we have used the R.-tree as an example, 
but our proposed technique may also be applicable to other spatial indexing 
structures albeit with certain modifications. Hence, in the near future, we also 
plan to investigate the use of other spatial indexing structures for performing 
spatial joins in GRIDs. 

When a cluster leader C, receives a query Q , it first checks its IID and IRD to 
ascertain whether any of its cluster nodes is relevant to Q. If Ci finds that none 
of its nodes is relevant to Q, it checks its IED and sends Q to its neighbouring 
cluster leaders which are relevant to Q. In case none of its neighbouring cluster 
leaders contain the answers to Q, G-i broadcasts Q to all of them. This process 
continues till either the answers to Q are retrieved or Q is timed-out. Assume 
the existence of n clusters in the GRID, C\, Ci, C%, ,...C n . Now suppose cluster 
Ci issues a spatial join query Q t for the data in cluster Cj. A straightforward 
solution would be for Cj to process Qi and return the results to Ci, but this 
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solution would not be efficient in scenarios where many spatial join queries, which 
attempt to retrieve data from the same regions in Cj, are issued from Ci to Cj. 
Our primary focus is to reduce the response times of such spatial join queries on 
remote data via replication. 

4 Issues Concerning Load-Balancing 

of Spatial Joins via Replication in GRIDs 

This section addresses important issues which need to be addressed when sup- 
porting load-balancing of spatial joins via replication in GRIDs. 



Hotspot Detection 

The heat of data regions should be defined with respect to clusters which issue 
queries for these regions. For example, if cluster Cs issues a large number of 
queries pertaining to data region Di of cluster Ct, but cluster Co issues no 
queries for Di , Di will be considered a ‘hot’ region only w.r.t. Cs (but not 
w.r.t. Co)- Understandably, many different clusters accessing Di infrequently 
would make D t a ‘hot’ region in the conventional sense, but replicating Di at 
any of these clusters may not necessarily be useful for reducing response times 
and may indeed be counter-productive owing to replication-related overheads. 
For hotspot detection purposes, every node within a cluster maintains its own 
access statistics comprising a list HotList, each entry of which is of the form 
(data, ptr access ). Here data represents a specific data region and ptr access is a 
pointer to a three-dimensional array of the form (cluster Ad, num, avgtime) , where 
clustered stands for the ID of the particular cluster which accessed data, num 
indicates the number of times that cluster accessed data and avgtime represents 
the average processing time it took for a node to perform spatial join on data. 
The value of avgtime is used in ascertaining the benefit of replicating a specific 
data region as we shall see in Section 5. This information is periodically sent by 
every node to its cluster leader, thereby enabling the cluster leader to determine 
the ‘popularity’ of different data regions with respect to different clusters. 

For reflecting current hotspots accurately in dynamic GRID environments, 
we propose that HotList should be initialized periodically i.e., the information 
in HotList should be deleted periodically and then HotList should be populated 
with fresh access information. Additionally, whenever an overloaded source clus- 
ter node Ni has completed offloading some part of its load to a node at another 
cluster, Ni refreshes its own HotList by deleting those entries in HotList which 
triggered the replication so that HotList reflects hotspots concerning which ac- 
tion has not yet been taken. Moreover, we maintain access statistics information 
at the granularity of the respective leaf node levels of the index structures at the 
nodes. Throughout this paper, we shall use the term data region to indicate 
the spatial region corresponding to the Minimum Bounding Rectangle (MBR) 
of the data stored at a leaf node of the IID at a particular node. An interesting 
question which arises here is: Given that several different kinds of queries can he 
issued to a real system, how do we know whether the leaf node accesses are being 
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made specifically for a spatial join query? Incidentally, at the leaf node level, it 
is not feasible to determine the type of query for which the leaf node is being 
accessed, but this information can be found at the query engine level. 



What to Replicate? 

Once a ‘popular’ region Dj w.r.t. a specific cluster C) has been detected, our 
strategy is to replicate the results of the spatial join operation on Di at Ci. Note 
that replicating the results (as opposed to replicating the data itself) can signifi- 
cantly benefit those subsequent spatial join queries whose spatial select windows 
have considerable overlap with Dj since Ci will not need to do any processing at 
all for a significant part of these queries. Additionally, replicating the results can 
reduce communication cost significantly if join selectivity of Di is low. Even in 
case of high join selectivity of Di, intuitively we can understand that the commu- 
nication cost of replicating the result tuples can never exceed that of replicating 
the data itself. Our strategy assumes that the datasets are relatively static. Note 
that we use ‘data replication’ and ‘replication of result tuples’ interchangeably 
throughout this paper to imply replication of result tuples. 



Exploiting Overlap Between Different Spatial Join Queries 

Interestingly, every spatial join query has an associated MBR associated with it 
either explicitly or implicitly. We shall designate this MBR as SPJMBR (Spatial 
Join’s Minimum Bounding Rectangle). For example, the join query “Find a 
hotel near a station in Athens within a 5 km radius of X, where X is a certain 
landmark in Athens” explicitly specifies the SPJMBR associated with the join 
query, while the query “ Find a hotel near a bus station in Athens ” implicitly 
specifies that the SPJMBR for this query corresponds to the MBR of Athens. 
Intuitively, efficient exploitation of overlaps between different spatial join queries 
requires a mechanism for storing the replicas in a manner which enables quick 
identification of overlap between spatial join queries and existing replicas. 

In our proposed system, whenever a replica of result tuples is stored at a 
cluster, the cluster leader also stores the SPJMBR corresponding to that replica. 
Identification of overlap between an SPJMBR and a spatial join query Q can 
be classified into 3 cases: (a) Q’ s MBR does not intersect with SPJMBR: This 
implies that there is no overlap between the query and the existing replica, 
(b) Q’s MBR is fully contained within the SPJMBR: This means that all the 
results tuples requested by Q are already in the stored replica and only a spatial 
select query using Q' s MBR as the spatial select condition should be run on 
the replicated data to obtain the answers to Q. We propose to run this spatial 
select condition on the replicated data at the cluster C'R ep where the replicated 
data exists (if Cn ep is not overloaded and has sufficient disk space) to save 
communication costs. However, if Cn ep is overloaded and/or Cn ep has insufficient 
available disk space, the replica is sent to the cluster Ci ssue which issued Q 
and Cissue needs to run Q' s MBR as the spatial select query on the replica 
to obtain the query results, (c) Q' s MBR partially intersects with SPJMBR: 
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The implication is that the results of Q already exist for the intersecting part 
between Q ' s MBR and SPJMBR, but for the non-intersecting parts, the results 
need to be computed. In this case, the tuples in the intersecting part are sent 
to the query issuing peer, while a spatial join operation needs to be run to get 
the result tuples in the non-intersecting parts between Q’s MBR and SPJMBR. 
This spatial join operation involving the non-intersecting parts of Q's MBR and 
SPJMBR should be run at Cn ep if Cf j ep ’s load is low, otherwise it should be 
executed at Ci SSU e- 

Whenever a spatial join query Q arrives at a cluster leader, the cluster leader 
traverses its list of SPJMBRs and identifies and exploits overlaps (if any) in the 
manner stated above. Additionally, in order to optimize disk space usage, each 
cluster leader keeps track of the replicas in its cluster nodes as well as the number 
of accesses made to each of the replicas during recent time intervals. Replicas 
whose access frequency during recent time intervals falls below a pre-defined 
threshold are deleted because the valuable disk space consumed by such unused 
replicas can be put to better use by storing ‘hot’ data, thereby improving system 
performance. 



5 Load-Balancing Strategy for Spatial Joins in GRIDs 

In our proposed strategy, a cluster leader determines itself to be overloaded if 
its load exceeds the average loads of its neighbouring clusters by more than y%. 
(The value of y is application-dependent and in our case, we assume y = 15%. ) 
When a cluster leader determines itself to be overloaded, it periodically checks 
the frequency with which its data regions 2 are being queried by other cluster 
leaders during the recent time intervals. Based on this information, the cluster 
leader C, creates a set ip comprising all cluster leaders which have issued more 
than r] queries for any of its regions. ( rj is a threshold parameter which influences 
the sensitivity of load-balancing.) C, sends a message to each member of set ip 
informing its own disk space requirement (i.e., the amount of disk space required 
to store the replicated data) and requesting information concerning their load 
status, list of neighbours and whether their available disk space is sufficient to 
store the replicated data. After receiving the necessary status information of ip's 
members, C\ evaluates their replies one-by-one. Members of ip whose available 
disk space is too low to store the replicated data or whose normalized load 
difference with C t (. A ) falls below a pre-specified threshold are deleted from ip. 
For such members, C) adds their list of neighbouring clusters to ip. For these 
neighbouring clusters, clusters with low disk space or those with low normalized 
load difference with Ci are deleted from ip. The remaining members of ip are 
candidates for replication. For each member a of ip, Ci traverses each hot data 
region H that has been queried by a and decides whether to replicate the spatial 
join result tuples associated with H on a case- by-case basis. Now let us see how 
Ci makes this decision. 

2 Recall that ‘data region’ refers to the spatial region corresponding to the MBR of 
the data stored at a leaf node of the IID at a particular node. 
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The total cost Ch of replicating H from C, at a consists of the cost Extra 
of extracting H from IID at Ci, the communication cost Coin of transferring H 
and the bulkloading cost Bulkn of integrating H into the IRD of a. Hence, Ch 
is given by the following formula: 

Ch = Extra + Cmn + Bulkn (3) 

Recall that every cluster leader maintains information concerning the average 
processing time and the accesses made to each data region from each of the 
clusters in the system. Let nn denote the number of times H has been accessed 
by a and avgtime a represents the average processing time of H. Hence, the 
benefit Bh of replicating H at a can be estimated as follows: 

Bh = ( nn x avgUmen) (4) 

From (3) and (4), we have the following formula: 

Decide h = (B H — Ch > TH min ) (5) 

where TH m i n is a pre-defined threshold parameter which is essentially applica- 
tion-dependent and on which the degree of load-balancing depends and Deciden 
is a boolean variable. Every member a of ip for which Deciden returns ‘TRUE’ 
is put into a temporary list data structure which we shall designate as ‘ temp ’. 
The data structure of ‘temp’ is essentially a list structure where for each ‘hot’ 
data region H , the corresponding destination candidates (those members of ip 
for which Deciden had returned ‘TRUE’) for H are stored in a linked list. 
Using the ‘temp’ data structure, the overloaded cluster leader uses a function 
Select_dest_from_temp( ) which selects (as the destination cluster) the least 
loaded member in ‘temp’ (corresponding to each H) for each H. The load- 
balancing algorithm executed by an overloaded source cluster leader is depicted 
in Figure 1, while the load-balancing algorithm executed by a potential destina- 
tion cluster leader is presented in Figure 2. 

Observe that in contrast with existing works in traditional environments, our 
strategy does not use the value of normalized load difference when deciding upon 
the amount of data to replicate. This is because in our scenario, the increase in 
load at a owing to spatial join queries on H is negligible (even in case of spatial 
select conditions) as compared to the decrease in load for Ci especially since 
the join has already been computed. Moreover, note that replication is initiated 
from Ci to a whenever the normalized load difference between Ci and a exceeds 
a given threshold, irrespective of whether G) is really overloaded or not. Even 
if Ci is not overloaded, we believe it is still reasonable to replicate at a since 
bringing the data closer to the cluster from where the data are being frequently 
queried implies a reduction in network overheads (as well as response times) for 
future spatial joins on the same data. 

6 Performance Study 

This section reports the performance evaluation of our proposed inter-cluster 
load-balancing technique via replication of result tuples of spatial join queries. 
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Algorithm LB_OvcrloadcdSourcc( ) 

Create a set ip comprising cluster leaders that issued more tlum y queries for any of its regions 
if (ip is an empty set) { 
exit 
} else { 

for each element a in set ip { 

Send message to a and asking a's disk space, load and neighbours’ list 
Receive reply from a 

if ( ( a’s disk space is NOT sufficient ) OR ( A < LOAD .THRESHOLD ) ) { 

/* A is the normalized load difference between itself and a */ 

Delete a from set ip and Add members of Listsaighbrnirp. to ip 
for each member NG of I/istNeigMours { 

Send message to NG asking NG' s disk space availability and current load 
Receive reply from NG 

if ( ( NG’s disk space is NOT sufficient ) OR ( A < LOAD-THRESHOLD ) ) { 
Delete NG from ip 

} 

} 

} 

} 

for each element a in ip { 

for each data region H queried by a { 
if ( Decides = = TRUE ) { 

Put a into a temporary' list designated as ‘temp’ 

} 

} 

} 

Select.dest _from.teinp( ) 

} 

end 

Fig. 1. Load-balancing Algorithm executed by an overloaded source cluster leader 



Note that we consider performance issues associated only with inter-cluster load- 
balancing since a significant body of research work pertaining to efficient intra- 
cluster load-balancing algorithms already exists. Hence, for our experiments, we 
use a cluster size of 1. The machine used for the experiments had processing 
capacity of 1.7 GHz (Pentium-4), main memory of 768 Mbytes and disk space 
of 40GB. We ran the experiments under the Redhat Linux (version 7.3) oper- 
ating system using LAM-MPI (version 7.00) for message-passing. In order to 
model inter-cluster communication in a wide area network environment, we as- 
signed transfer rates for communication between cluster leaders randomly in the 
range of 0.8 Megabit/second to 1.2 Megabit /second. We used a maximum of 3 
neighbouring cluster leaders corresponding to each cluster leader. The number 
of clusters simulated in our experiments was 24. The interarrival time between 
queries arriving at a cluster was fixed at 10 milliseconds and the value of TH m i n 
was set to 5 seconds. We have used two real-life datasets [3] for our experiments. 
The first dataset is the set of roads in Germany, while the second one is a dataset 
of railway lines in Germany. The first dataset comprises MBRs of 30,674 streets 
of Germany, while the second one consists of MBRs of 36,334 railroad lines in 
Germany. We had enlarged each of these datasets by translating and mapping 
the data for the purpose of our experiments. For our experiments, each of the 
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Algorithm LBJPotentialDestination( ) 

Receive message from overloaded source cluster leader SRC 
/* The message contains disk space requirement of S RC */ 

Send a Broadcast message to all the nodes in its cluster asking each node for its current 
load and disk space 

Receive replies concerning current load and disk space of each node 

Nodes with sufficient available disk space and load below a pre-defined threshold A are 

put into a set Candidate 

if ( Candidate is an empty set ) { 

Send message to SRC stating that its disk space is insufficient and informing SRC 
about, its list of neighbours 
} else { 

Send message to SRC informing SRC about its sufficient disk space, its current load 
and its list of neighbours 

} 

Receive reply from SRC 

if ( SRC has selected it as the destination cluster) { 

Send a Broadcast message to the nodes in Candidate for their current load status 
Receive the corresponding replies and select, the least loaded node MIN from Candidate 
Send a message to SRC to replicate the data at MIN 

} 

end 

Fig. 2. Load-balancing algorithm executed by each potential destination cluster leader 



clusters had more than 200000 rectangles for each of the relations. We used two 
R-trees at each cluster, one for each dataset. We assumed that one R.-tree node 
fits in a disk page (page size = 4096 bytes). Hence, R.-tree node capacity is the 
same as page size in our case. The height of each of the R.-trees was 3 and the 
fan-out was 64. We generated queries for each cluster by using a spatial select 
(window query) condition in conjunction with the spatial join. Note that this is 
in consonance with real-world scenarios where spatial joins may be quite often 
accompanied by certain select conditions. The selectivity of each spatial join 
query was fixed at 40%. Assuming n queries for a particular cluster Ci, let us 
designate the queries as Qi, Q2, ■■■■Qn- We generated the n queries for C, such 
that the queries had at least 75% overlap with each other. This overlap was gen- 
erated by shifting the respective spatial select query windows in such a manner 
that each query had x% (where x > 75%) overlap with the other queries. 

For performing the spatial join operation at each cluster, we use an existing 
approach where the data from the smaller fragment is extracted and used to 
probe the index structure corresponding to the larger fragment. For the sake of 
convenience, we shall refer to our proposed technique as LBR.EP (Load-balancing 
via replication). Since no work on load-balanced processing of remote spatial 
joins in GRIDs exists, we shall compare the performance of LBR.EP with a 
technique which performs spatial join without load-balancing. We designate this 
reference technique as NOLB (No load-balancing). For all our experiments, we 
had run the system for an initial period of time to obtain access statistics infor- 
mation and once the system had reached a stable state (after the replication of 
result tuples have been performed), we noted down the results. We only present 
results associated with the stable state of the system. 
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ing replication 

Fig. 3. Replication table and QD 1 and QD2 for a 24-cluster GRID 



The replications that have already been performed (based on access statistics 
information) prior to the system reaching stable state are depicted in Figure 3(a). 
In Figure 3(a), Csource represents the IDs of the source cluster whose data 
(spatial join result tuples) have been replicated, while C Destination stands for 
the IDs of the destination clusters where C Source’s data has been replicated. For 
example, the first row of the table indicates that a portion of cluster l’s data 
has been replicated at clusters 24 and 15. Similarly, a part of cluster 2’s data has 
been replicated at clusters 23 and 17 and so on. Note that the portions of cluster 
l’s data that have been replicated at clusters 24 and 15 need not necessarily be 
the same, even though overlap is possible between the replicated data of cluster 
1 at cluster 24 and cluster 15. This is because the replication performed was 
based on previous access statistics, thereby implying that the replicated data at 
different clusters depends upon the queries that these clusters had issued during 
the past. Now we shall evaluate the relative performance of LBR.EP and NOLB 
by using different query distributions. Even though we had used several query 
distributions to test the robustness of LBREP, in the interest of space, here we 
present only two such distributions. For the sake of convenience, we shall refer 
to these query distributions as QD 1 and QD2 respectively. Figures 3b and 3c 
summarize QD 1 and QD 2. In Figure 3b, N denotes the number of queries, Ce 
indicates the ID of the cluster which processed the queries, Cq represents the 
IDs of the clusters which issued those queries and / stands for the number of 
queries issued by a cluster. Note that the sequence of the queries arriving at each 
cluster is also specified by Figure 3b. For example, the first row of the table in 
Figure 3b indicates that 16 queries (let us designate them as Q 1 to Q16) were 
processed by cluster 1. Q 1 to Q 9 were issued by cluster 24, Q10 to Q16 were 
issued by cluster 15. In contrast, the first row in Figure 3c indicates that Ql to 
Q10 were issued by cluster 15, while cluster 24 issued Q 11 to Q16. Owing to 
space constraints, we are not able to present the detailed results concerning all 
the queries in the system. Note that the selectivity of each spatial join query in 
case of both QD 1 and QD 2 was fixed at 40%. In all our experiments, cluster 1 is 
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Fig. 4. Results on QD 1 for a 24-cluster GRID 



the most overloaded (hot) cluster and also it was the last cluster in the GRID to 
complete processing. Hence, we shall examine details concerning the processing 
of queries that were directed to cluster 1. 

Figures 4 and 5 depict the results corresponding to QD 1 and QD2 respec- 
tively. Figure 4a indicates the average response times of all the queries directed 
to each cluster. The results demonstrate that LBREP is indeed able to decrease 
the average response times for each of the clusters significantly, especially de- 
creasing the average response time of cluster 1 by upto 48%. The reduction in 
average response times occurs because of the reduction in disk I/O overhead at 
the query executing clusters as well as the reduction in communication over- 
head arising from transmission of result tuples to the clusters which issued the 
respective queries. To put things into perspective, we take a closer look at the 
processing of the 16 queries that were directed to cluster 1. Figure 4b depicts the 
individual response times of each of the 16 queries that were directed to cluster 
1 for QD 1, while Figure 4c shows the corresponding disk I/Os incurred for each 
query at cluster 1 for the same experiment. Figure 4d indicates the number of 
KBytes for each query that cluster 1 had to transmit to the cluster which had 
issued the query. 

Observe that Figure 4b indicates that for all the queries directed to cluster 1, 
LBREP’s performance is superior to that of NOLB in terms of response times. 
Such reductions occur because part of the results of the spatial join have already 
been replicated at clusters which issued these queries (clusters 24 and 15 in this 
case) . The implication is that cluster 1 did not need to process a significant part 
of each of these queries, thereby resulting in reduction of disk I/O cost incurred 
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Fig. 5. Results on QD 2 for a 24-cluster GRID 



by cluster 1. Moreover, since clusters 24 and 15 already had a part of the results 
associated with the queries that they issued, the number of result tuples that 
cluster 1 had to transmit to such clusters was also reduced, thereby reducing 
the communication overhead. Detailed investigation of the experimental results 
revealed that the reduction in disk I/O cost varied between 45% to 54%, while 
reduction in the total size of result tuples transmitted to the querying clusters 
varied between 46% to 52%. However, note that the price LBREP pays for 
improvements in response time is additional disk space usage since replication 
causes redundant usage of disk space. We believe that the overhead of additional 
disk space usage is justifiable because of the significant improvement in response 
times of spatial joins that LBREP provides. 

The explanations for Figure 4 also hold good for the results in Figure 5. 
Observe that the performance of NOLB remains same in case of Figures 4 and 5 
because in case of NOLB, no data has been replicated at the querying clusters, 
thereby implying that every query is completely processed at the query executing 
cluster and then the results are sent back to the querying clusters. We also find 
that the results in Figures 4 and 5 differ to some extent for LBREP. This is 
because the portions of cluster l’s data replicated at clusters 24 and 15 were not 
exactly the same, even though there was overlap between those portions. 

7 Conclusion 

Huge amounts of available spatial data worldwide and the prevalence of spatial 
applications, coupled with the emergence of GRID computing, provides a strong 
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motivation for designing a spatial GRID. Skewed user access patterns may cause 
severe load imbalance in the system, thereby degrading system performance sig- 
nificantly. Our proposal has specifically focussed on speeding up remote spatial 
joins in this environment via a novel dynamic load-balancing strategy which de- 
ploys online replication. In the near future, we plan to address issues concerning 
dynamic data. Incidentally, for dynamic data, query results may change, thereby 
requiring updates to be propagated to the clusters containing the old replicated 
result tuples. Moreover, we shall investigate scalability issues concerning larger 
number of clusters. Additionally, we also plan to examine the use of other spatial 
index structures for performing spatial joins in GRIDs. 
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Abstract. Searching for similar objects (in terms of near and nearest neighbors) 
of a given query object from a large set is an essential task in many applica- 
tions. Recent years have seen great progress towards efficient algorithms for this 
task. This paper takes a query language perspective, equipping SQL with the near 
and nearest search capability by adding a user-defined-predicate, called NN-UDP. 
The predicate indicates, among a set of objects, if an object is a near or nearest- 
neighbor of a given query object. The use of the NN-UDP makes the queries 
involving similarity searches intuitive to express. Unfortunately, traditional cost- 
based optimization methods that deal with traditional UDPs do not work well for 
such SQL queries. Better execution plans are possible with the introduction of a 
new operator, called NN-OP, which finds the near or nearest neighbors from a set 
of objects for a given query object. An optimization algorithm proposed in this 
paper can produce these plans that take advantage of the efficient search algo- 
rithms developed in recent years. To assess the proposed optimization algorithm, 
this paper focuses on applications that deal with streaming time series. Experi- 
mental results show that the optimization strategy is effective. 



1 Introduction 

In many applications, searching for similar objects of a given query object from a large 
given set is important. Similarity measure is best intuited as some distance and simi- 
larity is then usually expressed in terms of near and nearest neighbors. When complex 
objects such as time series are involved and object sets are large, the task of finding 
near and nearest neighbors becomes rather costly. A large body of research has been 
devoted to reducing this cost and efficient algorithms and indexing structures have been 
developed (see, e.g., [13, 12, 1,5, 11, 10]). 

To the best of our knowledge, however, there has not been any systematic study on 
how to incorporate near and nearest neighbor searches into the popular query language 
SQL and, more important, how to optimize the resulting queries. The purpose of this 
paper is to initiate such a study, considering the two aspects of the query language, 
namely expression and optimization, when it needs to deal with similarity searches. 

* This is an abbreviated version of the technical report [6]. 
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As an example application, consider the problem of monitoring different sources 
of time series data to detect certain events (e.g., onset of a flu season). Assume we 
have collected many patterns (in the form of time series) from historical data on school 
attendance and llu-related medicine sales at pharmacies. Using data analysis tools, we 
may have learned the strong correlation between the presence of certain events and 
the appearance of certain patterns in school and pharmacy data. For example, a sharp 
decrease in school attendance accompanied by a sharp increase in pharmacy sales three 
days in a row is a strong indication of the beginning of a flu season. Based on such 
learned "rules” and the time series data reported everyday regarding current school 
attendance and pharmacy sales, we may detect the appearance of certain events. 

To be more specific, suppose we have two sets of historical patterns, S for school 
attendance and P for pharmacy sales. For simplicity, let us assume S = {sl,s2,s3} 
and P = {pi, p2, p3}, respectively. We can store the rules learned from historical data 
in a relation called Events shown in Figure 1. Each row in the relation represents a 
rule learned from the historical data. For example, the first row says that when school 
attendance data matches pattern si and pharmacy data matches pattern pi . it usually 
indicates the peak of a flu season. 
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Fig. 1 . Events table with the rules learned from historical data 



The meaning of "match” in the above example can be in terms of near and near- 
est neighbor based on a similarity (or distance) measure. Near neighbors are defined in 
terms of the distance between a pair of objects (e.g., time series) irrespective of the ex- 
istence of other objects, while nearest neighbor is a relative notion, defined with respect 
to a set. Both notions are best used together. Thus, that the current school data matches 
pattern si may indicate that among all the school patterns, si is closest to the current 
school data (nearest neighbor notion), and at the same time , they are not too far away 
from each other (near neighbor notion). 

Clearly, algorithms and data structures that can provide efficient evaluation of near 
and nearest neighbors can be very helpful in the above example application. Flowever, 
it is also very important that the users should have an intuitive language to express 
these types of queries. At the same time, the system should figure out how to efficiently 
answer these queries, invoking efficient algorithms and indexing structures. For this 
purpose, we propose to add a user-defined-predicate (UDP) to SQL for users to express 
the notion of near and nearest neighbors in their queries. The UDP, called NN-UDP, 
indicates, among a set of objects, if an object is a near/nearest-neighbor of a query one. 

The use of the NN-UDP makes the queries involving near/nearest neighbors easy to 
express, since the user can intuitively treat it as a selection condition. However, when 
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the query is evaluated, we probably do not always want the predicate as a selection con- 
dition. Indeed, we would like to take advantage of the algorithms and data structures for 
finding near and nearest neighbors. For example, in our detection application, we may 
want to report the corresponding event name by performing the following three steps: 

1 . In pattern set S, find the nearest neighbor (call it si) of the current school series. 

2. In pattern set P, find the nearest neighbor (call it pj) of the current pharmacy series. 

3. Issue an SQL to select the eName of table Events with School = si and Pharmacy 

= pj as the conditions. 

Here, we use the direct method to find the near/nearest neighbors from sets of ob- 
jects instead of using the NN-UDP as selection conditions. Obviously, various strategies 
can apply to each of the above steps. 

However, the above strategy may not always be the best. For example, once the 
school pattern si is found from Step 1 , the number of tuples in the Events relation that 
satisfy the condition School = si may be so small that, in fact, a near/nearest neighbor- 
hood test of (in contrast to search for) the corresponding patterns from the Pharmacy 
column will be more beneficial. 

In order to make the correct decision on selecting the appropriate evaluation strat- 
egy, we introduce a heuristic optimization method which is based on a new operator 
NN-OP and the derived algebraic equivalence rules involving NN-UDP and NN-OP. 
Our experiments have confirmed the superiority of this systematic cost-based approach. 

By treating each NN-UDP in the query either as a traditional UDP or as the output of 
NN-OP, our proposed optimization method can find better execution plans than those 
appearing in previous work on UDP query optimization [4, 9,3,2], In [4], Chimenti 
et al. propose an algorithm for LDL system to optimize the queries with UDPs. In 
LDL, each UDP is treated as a relation during the query optimization process. However, 
since it uniformly treats each UDP as a relation, it may fail to consider some efficient 
plans. Hellerstein and Stonebraker propose Predicate Migration algorithm [9, 8] that 
improves the LDL approach by pushing down selections on both operands of a join. 
Later, Chaudhuri and Shim present several efficient algorithms that are able to guarantee 
the optimal plan over the desired execution space and show these proposed algorithms 
can either find the optimal plan or efficiently find a plan that is very close to the optimal 
one [3]. However, all these algorithms uniformly treat UDPs as selection conditions. 

Our work is also related to [2] by Chaudhuri and Gravano, which deals with query 
optimization when external searches are involved. They study the optimization of 
queries over multimedia repositories, and assume that query predicates are indepen- 
dent. However, in this paper, the similarity searches on different sources are less likely 
to be independent since they are involved in the detection of the same event. 

The contribution of this paper can thus be summarized as follows. Firstly, we take a 
query language perspective to deal with similarity searches. The novelty of our language 
is on the incorporation of the nearest neighbor searches. This makes similarity-based 
queries easy to write and provides a powerful tool for various related tasks. Secondly, 
we provide a heuristic optimization algorithm to derive efficient evaluation plans for 
these queries, fully taking advantage of the efficient algorithms and index structures for 
near/nearest neighbor for evaluation. Thirdly, we use experiments to demonstrate the 
effectiveness of our optimization algorithm for the (streaming) time series case. 
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The remainder of the paper is organized as follows. In Section 2, we define our 
extension of SQL, called SQL/sim, to incorporate the similarity search capability into 
SQL. In Section 3, we discuss our optimization algorithm, including algebraic equiva- 
lence rules, and our heuristic method for deriving optimized evaluation plans. We report 
experimental results in the (streaming) time series case in Section 4, and in Section 5, 
we conclude our paper with some future research directions. 

2 SQL/sim 

In this section, we provide a simple extension of SQL, called SQL/sim, that offers the 
capability of expressing similarity searches in RDBMSs. We start with defining the 
similarity between pairs of objects. 

Definition Given two objects p and q, the similarity measure, denoted sim( p, q), is a 
non-negative real number. 

The similarity metrics might have a positive or inverse relationship with respect to 
the similarity of two objects. For example. Correlation Coefficient is positive, i.e., the 
large the metric value, the more similar the objects are. On the other hand. Euclidean 
Distance is inversely related to the similarity of objects, i.e., the smaller the metric 
value, the more similar the objects are. Without loss of generality, in this paper, we 
assume that if two objects are more similar, then the similarity metric is smaller. In this 
case, similarity measure is more like a distance measure. 

Definition Let a be a non-negative real number. Given a query object q, an object p is 
said to be its a-near neighbor if sim { p, q) < a. 

In the above definition, the number a is called the nearness threshold. This near-neigh- 
bor definition relates similar objects independently of the existence of other objects. 
In some situations, it is meaningful to obtain similarity between the query object and 
an object relative to a set of objects. We call this relative measure as the /'-nearest 
neighbors defined as follows: 

Definition Let k > 1 be an integer, P = { p i . p 2 , - - - - Pm } a set of objects, and q a query 
object. An object p, in P is said to be one of the k-nearest neighbors of q in P if there 
are at most k — 1 objects pj (j 7 / i, 1 < j < m) such that sim(pj, q) < sim(pi, q). 

Integer k above is called the rank of similarity 1 . For k = 1, the 1-nearest neighbor 
of q is normally abbreviated as the nearest neighbor of q. 

2.1 Required Relations 

In order to model the notions of objects and object sets for the purpose of near/nearest- 
neighbor searches in a RDBMS, an application needs to set up at least two relations. 
One corresponds to the set of all (pattern) objects (abstractly called Patterns), and the 
other to the collection of (pattern) object sets (called PatternSets). Each pattern set must 
be a subset of the Patterns. 

1 For simplicity, we assume no two pairs of objects will have exactly the same similarity mea- 
sure. In real applications, we remark it’s easy to lift this restriction. 
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For our running example, the Patterns are all the historical patterns and the Pat- 
ternSets are those collections of historical patterns related to specific types of events 
(e.g., all the patterns related to school attendance are collected into schoolSet). 

In terms of relation schemas, the relation for Patterns should have an ID attribute 
(PID) that is the primary key of the relation; and the relation for PatternSets should 
have two attributes that are for the ID of the sets (SID) and the ID of the Pattern (PID). 
The primary key must be these two attributes together, and the PID must reference to 
the ID of the Patterns relation. These two relations only need to use the identifiers of 
the objects and the object sets, while the objects themselves may be stored elsewhere 
inside or outside of the relational database. For example, the objects themselves may be 
stored as BLOBS in a RDBMS. 

The relations corresponding to our running example are shown in Figure 2. Here, the 
two required relations are given by Patterns and PatternSets, respectively. We assume 
that all patterns in the Event relation (under the attributes School and Pharmacy ) in 
Figure 1 all refer to attribute PID in the Patterns relation. 



Patterns (P) PatternSets (PS) 



PID 


pName 


sl 


FastJJp 


s2 


Up-Down 


s3 


Slow JJp 


pi 


Slow-Down 


p2 


Quick-Change 


p3 


Fast-Down 



SID 


PID 


schoolSet 


sl 


schoolSet 


s2 


schoolSet 


s3 


pharmacySet 


pi 


pharmacySet 


p2 


pharmacySet 


p3 



Fig. 2. For our running example, the required Patterns and PatternSets relations 

Query objects need not belong to relation P. They can be stored in relations or can 
be constant IDs that are understood by SQL/sim. In this paper, we will use constant IDs. 

2.2 NN-UDP 

To equip SQL with the near/nearest-neighbor search capability, we introduce a user- 
defined-predicate (UDP) as follows. 

Definition NN is a 5-ary predicate such that for each given query object QID, pattern 
set SID (from relation PS), pattern PID (from relation P and in the set SID), integer 
RK > 1, and real number TH > 0, 

NN( QID, SID, PID, RK, TH) =True 

if and only //'PID is one of the RK-nearest neighbors of QID in the pattern set SID, and 
the similarity measure between them is no greater than TH. NN is called the NN-UDP. 

For example, NN(‘s\ ‘schoolSet’ , ‘si’, 1, 0.2)=True if and only if si is the nearest 
neighbor of s in schoolSet, and the similarity measure of si and S is no greater than 0.2. 
Unlike most other proposals, NN-UDP incorporates both near and nearest neighbor 
tests. 
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In the above definition, the input relation takes five parameters. Among them. QID 
and PID are specific values, while SID, RK and TH can take NULL values. The seman- 
tics of these three attributes with NULL values are: 

1. If SID=NULL, the pattern set is all patterns in P. 

2. If RK=NULL, the similarity rank is infinity. 

3. If TH=NULL, the similarity threshold is infinity. 

Therefore, SID=NULL is a special case to take all the patterns as a “global” set 
for the purpose of nearest neighbor test. Furthermore, by allowing NULL either for 
RK or TH, we can use NN-UDP for either near or nearest neighbor tests, respectively. 
In the cases where both RK and TH are NULL, AW(QID. SID, PID, RK, TH) =True 
means that PID is in the set SID. On the other hand, if neither RK nor TH is NULL, 
then NN(Q\D, SID, PID, RK, TH) =True means that PID is both a near and nearest 
neighbor of QID. 

2.3 SQL/sim Examples 

With SQL/sim, users can write similarity-based queries in an intuitive manner. Here we 
give two example queries. The first corresponds to the flu detection scenario mentioned 
in the introduction. 

Example 1. Report the event name based on the observed school time series s and 
pharmacy time series p. More specifically, the event name is decided by: (1) the near- 
est neighbor of school attendance time series in schoolSet with similarity measure no 
greater than 0.2; (2) the nearest neighbor of pharmacy time series in pharmacy Set', and 
(3) the rules in the Events table, as described in Figure 2. The query in SQL/sim is: 

SELECT E.eName 
FROM Events E 

WHERE NN(‘s’, ‘schoolSet ’ , E. School, 1, 0.2) 

AND NN(‘p’, ‘ pharmacy S et ' , E. Pharmacy, 1, NULL) 

The result is a list of event names. 

Example 2. As another example, we want to find the PIDs of the nearest neighbor of 
the school attendance time series s in schoolSet if this nearest neighbor satisfies the 
following two conditions: (1) the similarity measure is no greater than 0.2; and (2) ac- 
cording to the rules in table Events , this nearest neighbor pattern is correlated to the 
event named Flu_peak, i.e., this nearest neighbor appears in a row of Events table with 
eName=‘Flu_peak’. The query is as follows, and the result is a list of school IDs. 

SELECT E. School 
FROM Events E 
WHERE E.eName= ‘Flu.peak’ 

AND NN(‘s’, ‘schoolSet’, E.School, 1 , 0.2) 



3 Optimizing SQL/sim 

In this section, we develop a strategy to optimize queries in SQL/sim. We start with an 
example to show various options for evaluating queries in SQL/sim. We then describe 
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a heuristic method that can, in many cases, automatically take the best option when 
evaluating a query. 

Consider the query in Example 1 . Using traditional optimization methods for UDPs, 
two execution plans are possible: 

(1) One may consider the order of applying the two NN-UDPs in the query. Cost and 
selectivity should both be considered for this ordering as in [3,9]. 

(2) Another method is to consider each NN-UDP as a relation as in [4]. In this case, 
both occurrences of NN will be evaluated on the entire relation and the results are 
joined together (with Events relation again). The join order needs to be carefully 
considered as in [4] . 

Each of the above strategies has its advantages in particular situations as explained 
below. This is due to the fact that, in most situations, algorithms that test if an ob- 
ject is a near/nearest neighbor are faster than those that search for near/nearest neigh- 
bors 2 . More specifically, in our example, if the number of patterns in Events under the 
School attribute is large, then it is beneficial to find the nearest neighbor (with RK=1 
and TH=0.2) of the query object S by using some indexing method. Since at most one 
school attendance pattern will be found by this process, once this is done, we can select 
the tuples in the Events relation that contain that particular school attendance pattern, 
and project out the patterns under the Pharmacy attribute in these tuples. If the number 
of the resulting patterns is small, we can then use the NN-UDP to test each one. If the 
number is still large, we can use an index-based algorithm to find the nearest neigh- 
bor of p, and join back to the relation Events to obtain the final result. (This last case, 
where index-based algorithms are used twice, corresponds to the strategy found in [4]). 
Of course, this whole process can be done starting with the second occurrence of the 
NN-UDP. 

Another possibility is that there are only a few tuples in Events. Then testing each 
one of them using the two NN predicates will probably be the best strategy. Here, the 
strategy of [3, 9] should be considered. 

The above example shows that each of the traditional methods mentioned previously 
may be best in certain situations. However, a combination of these methods may be 
called for in certain other situations. The choice must be made by considering the cost, 
selectivity and sizes of the involved operations and intermediate results. 

3.1 Near/Nearest-Neighbor Operator 

From the above example, we can see that we cannot simply treat NN-UDPs as selection 
conditions or their output as relations. Rather, we need to choose different plans for 
different query instances. Sometimes we need to use index-based algorithms to directly 
find near/nearest-neighbors, and sometimes we may use the NN-UDP directly. 

2 For example, we used a scan method for both testing and searching in our experiments. In 
our scan method, search needs to look through the entire pattern set, while test can stop much 
earlier when a nearer object is found and hence is faster in general. If multidimensional index 
is used for multidimensional objects, a test is only to ask whether there is any object that is 
within a range and hence is generally faster than searching for the exact objects that are the 
near/nearest neighbors. 
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Fig. 3. Example of NN-OP, assuming si is the near/nearest neighbor of S in schoolSet (with 
RK = 1 and TH = 0.4), and p3 is the near/nearest neighbor of p in pharmacySet (with 
RK = 1 and TH = 0.5) 



In order to derive such optimized evaluation plans, we need to define an operator 
that encodes the use of an indexing algorithm for finding near/nearest-neighbors. 

Definition Let S = {QID, SID, RK, TH}. The relational operator D is defined as fol- 
lows: For each relation R whose schema contains all the attributes in S, D(R) is the 
relation with the schema S U (PI D} such that a tuple t is in D(R) if and only if (1) f [S'] 
is in ns(R), and (2) AW(i[QID], f[SID], f[PID], f[RK], t[TH]) = True. Operator D is 
called the NN-OP. 

Intuitively, for each tuple t, in R, D finds the near/nearest neighbors for the query ob- 
ject /[QID] among the patterns in the set /[SID] with rank and threshold RK and TH, 
respectively. Figure 3 shows an example of the D operator. 

The above definition of D is extended to relations that have some of the attributes 
in S missing. More specifically, R may not contain any of the attributes SID, RK, TH. 
In these cases, the output of D will not have these attributes either, and the condition 
(2) in the above definition will take NULL value in place of / [SID], /[RK], / [TH], 

It should be noted that the output size of NN-OP may be even bigger than that of 
the input relation. For example, if RK = k, then for each tuple in R , D(R) may contain 
k tuples derived from it. 

3.2 Equivalence Rules 

As explained in the beginning of this section, our goal is to use the NN-OP in place of 
some NN-UDPs in evaluating a query. In this section, we give a set of transformation 
rules for this purpose. 

Our transformation rules work on the relational algebra expression derived from 
SQL/sim. In a relational algebra expression, in a natural way, we treat NN-UDP as 
a selection condition on a relation R containing attributes QID, SID, PID, RK, and 
TH, written as unn(R)- If R does not have any of the attribute SID, RK, TH, the 
corresponding value is treated as NULL. 

As an example, let R\ be a relation with schema {QID, SID, RK, TH} that con- 
tains only one tuple (‘s’, ‘schoolSet’ , 1, 0.2) and FQ be a relation with schema {QID, 
SID, RK} that contains only one tuple (‘p’, ‘pharmacySet’ , 1). The SQL/sim query in 
Example 1 can be written in relational algebra form as 7r e jv ame (i?" 1x1 R" ) > where 

R" = n eid , eName{v n n{R'\)) with R[ = pschooi^pioi R\ x Events ), and 

R '2 = ^EID,eName(o’NN(R^)) with R' 2 = p Pharmacy ~^PId(R2 X Events). 

Note that p is the renaming operator. 
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In the following, R can be any relation with the implied attributes. For notational 
convenience, in these rules and the later plans, we will indicate the ordering of the 
operators by nested algebraic expressions. 

Figure 4 shows the equivalence rules. Rule 1 gives the equivalence transformation 
between <7nn and D. Rule 2 shows how to move a selection operator a inside the NN- 
OP D. Likewise, we can exchange the order of join and NN-UDP as in Rule 3. Rules 4-7 
are useful to prune some operators and the associated relations from the plan. Note that 
Rule 4 and 5 are different since in Rule 4, relation PS refers to the PatternSets relation 
which contains attribute PID, while in Rule 5, the schema of R does not contain PID. 



Rule 1 


^ATAr(R) = RMD(R) 


Rule 2 


a c {D(R)) = D(a c (R)), if c only refers to attribute(s) in {QID, SID, RK. TH}. 


Rule 3 


-RiM_D(jR2) = Ri^\D{Ri^R2) 


Rule 4 


PSMD(RMPS) = D(RMPS) 


Rule 5 


RMD(RMPS) = D(RMPS) if R's schema is a subset of {QID, SID, RK. TH}. 


Rule 6 


tt p (D(R)) = 7 t p (R) if p is a set of attribute that does not contain PID. 


Rule 7 


D(n p (R)) = D(R) if p is the subset of {QID, SID, RK, TH} appearing in R’s schema. 



Fig. 4. Equivalence rules (PS is PatternSets) 



With the set of equivalence rules shown in Figure 4, we can transform a query plan 
involving NN-UDPs into equivalent ones with NN-OPs only or a combination of NN- 
UDPs and NN-OPs. We use the query in Example 1 to illustrate how to use these rules 
to obtain different query plans. 

At the beginning of this subsection, we represent the query as it e Name(R" x R 2 ). 
This expression corresponds to the straightforward way of executing the query by test- 
ing the two NN-UDPs independently, joining the two testing results on attribute EID, 
and projecting out the interested attribute. 

Alternatively, we can use the equivalence rules to generate another query plan: 

tt eName(.R 1 X R 2 ) 

= 7 t e N ame ) 7 t EID,eName(tr NN (R\)) X 7 t EID ,eName(.tr N N (R 2 ))) 

= tt eName {it eid ,eNam,e {cr n n (Rj ) ) XI a n n {R'2)) / / standard equivalence 

— " {it EID, eName(R'l XI D(R' 1 )) X <7jViV (-R 2 )) // Rule 1 

= It eName(tt EID ,eName(R'l X D(Rl)) X (JAW (R2 ) ) 

//since -D(7tqi Di sid,rk,th(Ri)) = D(R[) by Rule 7 , and 7tqi D ,sid,rk,th(R , i) = Ri 
= tt eName n n {it eid , eName (.R 'i x D(Ri)) x R^)) U standard equivalence 

The resulting expression corresponds to the following query plan: We first use an index- 
based algorithm to discover the nearest neighbor of s in the pattern set schoolSet (this 
corresponds to D(R\)). We then find the events that use that particular school pattern 
(i.e., the first join R[ xi I)( R \ )). We then join the result with R' 2 , the input relation 
of the second NN-UDP, on attribute EID and eName. Finally, we test the second NN- 
UDP on the join output and project out the interested attribute eName. 

Naturally, we can get another plan in a symmetrical way: 

Tt e Name{vNN{TtEID , eName (R' 2 x D(R 2 )) x R[)). 
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As another possibility, we may continue the above transformation as follows (pick- 
ing up from the second to the last step): 

tt eName(R 1 7Q ) 

= 7T eNameipt EID ,eName[R\ tXI D(77l)) t<l (T N N i^R 2 )) 

— 7t eName^Tt EID ,eName(R\ C<] D(T/\ ) ) IX] TV EID . eName ( 77 2 ^ D(R 2 ))) 

//same as done to cjnn(Ri) 

The resulting expression corresponds to the following query plan: We first use an index- 
based algorithm to discover the nearest neighbor of s in the pattern set schoolSet, as 
well as the nearest neighbor of p in the pattern set pharmacy Set . We then join the re- 
sults with R[ and R' 2 , respectively, to find the corresponding events, and finally obtain 
the common events by a join. 

In the above plans, we have only used Rules 1 and 7. For other queries, e.g., pattern 
sets IDs are from a relation instead of being a constant, other rules will be useful. 

3.3 Optimization Procedure 

From the example given in the beginning of the section, we can see that a good execu- 
tion plan is more likely to combine the use of both NN-UDPs and NN-OPs. Even though 
the equivalence rules of the previous subsection, together with the standard equivalence 
rules from the relational algebra, can be used to search through all the execution plans, 
it is obviously a very large space for an exhaustive search. 

Instead of using exhaustive search, we give an optimization algorithm, called the 
UdpOp algorithm. The major steps of the UdpOp algorithm are outlined in Figure 5. 



Step 1: 


For each subset of NN-UDPs, do the following: 

- convert NN-UDPs in the subset to NN-OPs; 

- push NN-OPs as close to leaves as possible; 

- optimize with a traditional method by treating the output of NN-OPs as static relations. 


Step 2: 


For each plan found in Step 1, push join and selection into the NN-OPs, and re-evaluate 
the costs. Find the least costly plan from all the resulting plans. 


Step 3: 


Find the execution plan for the NN-OPs in the plan obtained in Step 2. 



Fig. 5. Major steps of the UdpOp algorithm 

Step 1 of UdpOp is to select a subset of the NN-UDPs to be converted to NN-OPs. 
In this step, we enumerate all subsets of the NN-UDPs in a query. We argue this is not 
too much overhead since in real applications with similarity-based queries, the numbers 
of UDPs should not be too large. In special cases where there are many UDPs, heuristics 
may need to be adopted to reduce the search space. 

For each possible choice of converting NN-UDP to NN-OP, we push NN-OPs as 
much down to the leaves as possible. This is done for two reasons. The first is to give 
more flexibility for the traditional optimization algorithms (that deal with UDPs, i.e., 
all the remaining NN-UDPs). The second is that in an evaluation plan, the cost of an 
NN-OP does not usually depend where it is performed (there are exception, see below). 
We then treat each NN-OP as a static relation and hand over the query to an optimizer. 
This optimizer will treat NN-UDP as traditional UDPs. 
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In this paper, we assume the optimization algorithm of [3] is used. Since [3] only 
deals with selection-projection-join (SPJ) queries, we will restrict our queries to SPJ 
queries as well. The algorithm of [3] treats UDPs as selections and assumes the selec- 
tivity and the per-tuple-cost for each UDP, including NN-UDP, are known. For specific 
applications, we need to develop corresponding methods to provide such estimates. 

In addition to the selectivity and per-tuple-cost of each UDP, we also need to know 
the size of each NN-OP since we treat it as a static relation. Furthermore, in the overall 
cost of a plan, the cost of deriving the result of the NN-OP should also be known. 

Once the plan is obtained from Step 1, we try to push selections and joins into the 
NN-OP. The reason to do this is that sometimes, selection and join may reduce the 
number of query objects or even the pattern sets to be considered by the NN-OP. In 
some cases, the rank and threshold may also be reduced. For each plan obtained, we 
need to evaluate the overall cost. Step 2 will select the plan with the least cost. 

After Step 2, we are “committed” to the evaluation plan. However, there is still 
some chance to refine the overall plan. Since we are dealing with SPJ queries, if the 
output of an NN-OP is empty, then the overall query will be empty and we can stop 
processing any other NN-OPs and the rest of the evaluation plan. Step 3 is mainly for 
this purpose. We assume we have an estimate of how likely an NN-OP will yield empty 
result, in addition to its cost. This is of course related to the size of the NN-OP (i.e., 
empty means 0 tuples) but a more specific statistic. 

To conclude this section, we summarize the statistics that we need for the UdpOp al- 
gorithm: (1) cost per tuple of each NN-UDP, (2) selectivity of each NN-UDP, (3) cost of 
each NN-OP, (4) size of the output relation of each NN-OP, and (5) probability of having 
empty output of each NN-OP. For different applications, we need to develop different 
methods to obtain these statistics. In [7], we provided a method for some streaming time 
series cases. Also note, since similarity-based searches are typically very time consum- 
ing, we omit the cost of obtaining the statistics and running the optimization procedure 
when performance of query evaluation is assessed. 

4 Experimental Results 

To assess the effectiveness of our optimization algorithm, we implemented the algo- 
rithm proposed in this paper in C/C++. In addition, for similarity-based search with 
streaming time series, we developed a method to estimate the statistics used by the 
optimization algorithm (see [7] for detail). In this section, we present the results of 
performance evaluation with our UdpOp algorithm through three experiments. More 
detailed experimental results can be found in [6]. 

All experiments are performed on a desktop box with PHI 1.2GHz CPU and 512M 
memory, running Windows XP Professional. Since we use the relative costs to evaluate 
the performance, the hardware environment does not make much difference. Because 
similarity-based queries are mostly CPU bound in all experiments, we load the data into 
the memory in advance and thus only the computational costs are measured. 

NN-UDP on streaming time series. A streaming time series is an infinite sequence of 
real numbers that continuously arrive at a query system. At each time position T, the 
streaming time series takes the form of a finite sequence, assuming the last real number 
is the one that arrived at the time position T. 
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Relative Input Size (% of pattern set) 

Fig. 6. Optimization comparison for Exp. 1 



An NN-UDP (see Section 2.2) may use a streaming time series s as its query object 
(i.e., QID). In this case, we assume an implicit positive integer parameter w, and the 
NN-UDP is evaluated at each time position T by using the tn- suffix of s up to (and 
including) time T. Hence, we assume each query containing this kind of NN-UDP need 
to be evaluated at each time position T when a new value of s arrives. 

For our running example, the streaming time series are school attendance stream 
and pharmacy data stream whose data are continuously collected and sent to a query 
system 3 . Queries like those in Examples 1 and 2 will be evaluated each time when new 
school attendance and pharmacy data arrive. 

Data generation. We use synthetic time series data sets in the experiments. We generate 
two types of data: the pattern time series and the streaming time series. 

We first use a random- walking function to generate each pattern time series indepen- 
dently. The length of each pattern time series is between 50 and 300. Next, we construct 
each pattern set by randomly choosing 10 5 patterns. Finally, we construct the streaming 
time series. To form a streaming time series, we choose some patterns from one pattern 
set and then concatenate them into one sequence by interpolating a curve between two 
successive patterns. In choosing patterns from a pattern set, we follow some predefined 
probability distribution so that some patterns appear in the streaming time series more 
often than the others. To simulate the real world cases, we also add some white noise 
into each constructed streaming time series. 

In the following experiments, we run each query multiple times and use the average 
cost to measure the performance of each optimization method. In each evaluation, we 
first build the estimation models for the statistics that will be used by UdpOp algorithm, 
by running the query for 1000 time positions [7], We then implement different methods 
and run the queries for another 500 time positions (called one run). 

Experiment 1 . In the first experiment, we use the query in Example 2. As discussed 
earlier, there are three possible ways to generate the query plan for this query: (1) NN- 
UDP: treating UDP as NN-UDP; (2) NN-OP: treating UDP as NN-OP; and (3) UdpOp: 



3 We assume these two streaming time series are synchronized. 
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obtaining the plan using UdpOp optimization algorithm. Note that UDP may be treated 
as NN-UDP or NN-OP in (3), depending on the statistics. 

In the experiment, we vary the size of relation Events so that the number of tuples 
that are fed into the UDP varies from 10% to 100% of the size of pattern set schoolSet. 
For each case, we execute the query using each of the three plans and measure the cost. 
We also measure the cost of the “best plan”, which is obtained by trying all possible 
plans and choosing the least-costly one. 

The results are shown in Figure 6, which are normalized by the maximum cost for 
the respective input sizes, ranging from 5ms (for the “10% of pattern set” case) to 120 
ms (the “whole pattern set” case). We can see that both (1) and (2) may result in very 
costly plans in some cases while our UdpOp algorithm always achieves good plans. 

Experiment 2. In the second experiment, we change the schema of relation Events 
to represent more complicated rules. The new schema is (EID, eName, Al, A2, .... An), 
where attribute Ai (1 < i < n) refers to patterns in pattern set setj . We then populate the 
Events table and consider the following query for streaming time series dl, d2, ..., dn. 



SELECT E. eName 


FROM 


Events E 


WHERE 


NN(‘d1 ‘seti’, E.A1 , K1, TH1) 


AND 


NN(‘d2’, ‘seta’, E.A2, K2, TH2) 


AND 


NN(‘dn’, ‘setn’, E.An, Kn, THn) 



We consider two independent pairs of stream and pattern set, and hence two UDPs. 
We set up the data so that, dependent on the data in the Events table, the near/nearest- 
neighbor patterns discovered for one of the two streams may or may not help to narrow 
down the scope of the patterns to be considered for the other stream. More specifically, 
we generate multiple Events relations by using random subsets of the Cartesian Product 
of two sets (each having 10 5 patterns), one for each of the two Ai columns. The ranks 
in both UDPs are NULL and the thresholds are both set to 0.07. Two streaming time 
series are randomly chosen from a set of streams. 

There are three possible optimization methods for comparison with the UdpOp al- 
gorithm, which are, (1) 2-NN-UDP Method: both UDPs are NN-UDPs, (2) 2-NN-OPs 
Method: both UDPs are converted into NN-OPs, and (3) 1-UDP/l-NN-OP Method, that 
is, we always convert one UDP into NN-OP and keep the other as NN-UDP. 

We run this experiment 300 times. For each run, we use one of the randomly gener- 
ated Events relations and the randomly chosen streaming time series and continuously 
evaluate the query for all 500 time positions. We take the average cost of these 300 runs 
as the performance measure for each optimization method. In addition, we also get the 
performance of the best plan, the least cost plan obtained by trying all possibilities. The 
result is in Figure 7, with all costs normalized by the maximum cost of 15 milliseconds. 
From this graph, it is clear that the performance of the UdpOp algorithm is very close 
to the best one. 

Experiment 3. In this experiment, we use the same setup as for Experiment 2, i.e., 
we use the same schema of relation Events and the same SQL query. In contrast to 
Experiment 2, we test the scalability of our algorithm when the number of UDPs goes up 
in the following situation. All UDPs deal with the same pattern set but with a different 
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Fig. 7. Optimization comparison for Exp. 2 





* Optimization Method 1 (all as NN-UDPs) 

Optimization Method 2 (only 1 -NN-OP) 
— © — UdpOp Optimization (NN-OP/UDPs) 

. - -o- Best Plan (testing all possible plans) 






0.2 1 1 1 1 1 1 1 

2 3 4 5 6 

Number of U DPs 



Fig. 8. Optimization comparison for Exp. 3 



stream, and all thresholds are 0.07 and the rank of similarity is 1 . Hence, we look for the 
pattern in the pattern set that is similar to all the streams. We vary the number of UDPs 
from 2 to 6, and set the size of the Events as 20% of the pattern set. We generate the 
Events relation such that no value appears twice in the same column. This guarantees 
that if we find a nearest neighbor with the evaluation of one UDP (under one column), 
then there is only one tuple in relation Events which needs to be fed into other UDPs. 
Clearly, in the optimized plan, other UDPs should remain NN-UDPs since only one 
tuple needs to be verified, and the ordering of UDPs and which UDP being converted 
into NN-OP will determine the performance. 

Given the number of UDPs, we randomly pick up the same number of streams. Two 
other optimization methods are used for comparison. The first one is to treat all UDPs 
as selections, i.e., none of the UDPs are converted to NN-OP. Then we use the algorithm 
proposed in paper [3] to obtain a plan. More specifically, we use the estimation models 
to find the rank, as defined in [3], for each UDP and then get the ordering among them. 
The second optimization method will always convert only one NN-UDP into NN-OP, 
which is the one estimated to have the least cost among all possibly converted NN-OPs. 

The performance comparison is shown in Figure 8, as the average cost of 210 runs 
for each number of UDPs. Again, we normalized all costs by a maximum value that is 
roughly 15 milliseconds. We can see that the performance of the plans obtained by the 
UdpOp algorithm is close to the best one. 



5 Conclusion 

In this paper, we introduced a user-defined predicate (UDP) for expressing queries in- 
volving similarity searches. We provided an optimization algorithm to derive efficient 
evaluation plans. In the (streaming) time series cases, our experiments demonstrated the 
good performance of our optimization algorithm. 

We mainly focused on the situation where there are only a few query objects (e.g., 
time series and streaming time series). In our examples, we mostly used constants to 
represent them. We believe in real applications, this is mostly the case. However, for 
applications where the query series are massive, there are opportunities to further opti- 



478 



Like Gao et al. 



mize the queries. For example, when NN-OP is applied to many combinations of query 
series and pattern sets, many combinations may not find any pattern within the required 
threshold. This finding can be useful to optimize other NN-OPs in the same query. Same 
observation applies to the number of pattern sets. 
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Abstract. XML is rapidly emerging as a dominant standard for repre- 
senting and exchanging information. The ability to transform and present 
data in XML is crucial and XSLT is a relatively recent programming 
language, specially designed to support this activity. Despite its utility, 
however, XSLT is widely considered a difficult language to learn. 

In this paper, we present XSLTGen: An Automatic XSLT Generator, a 
novel system that automatically generates an XSLT stylesheet, given a 
source XML document and a desired output HTML or XML document. 
It allows users to become familiar with and learn XSLT, based solely on 
their knowledge of XML or HTML. Our method is based on the use of 
semantic mappings between the input and output documents. We show 
how such mappings can be first discovered and then employed to create 
XSLT stylesheets. The results of our experiments show that XSLTGen 
works well with different varieties of XML and HTML documents. 



1 Introduction 

XML is rapidly emerging as the new standard for data representation and ex- 
change on the Web. As the medium for communication between applications, an 
ability to transform XML to other data representations is essential. This data 
conversion can be performed by XSLT (extensible Stylesheet Language: Trans- 
formations) [3]. XSLT plays an important role in transforming XML into HTML, 
text, or other types of XML 1 . However, XSLT is a relatively new language, which 
is widely considered difficult to learn [8]. Rendering to HTML using XSLT re- 
quires skills of XSLT programming as well as Web page styling. Hence, we focus 
on developing a tool that can automatically generate XSLT stylesheets, given a 
source XML document and a target HTML document, provided by the users. 

Automatic XSLT generation is an extremely useful facility for students and 
Web developers in the process of learning XSLT. Such a tool enables them to see 
and understand how the XSLT stylesheet should look, in order to transform a 

1 This paper focuses on XML to HTML transformations since we are motivated by 
publishing applications, but our techniques are also applicable for XML to XML 
transformations . 
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particular XML document into a desired HTML document. This tool is also use- 
ful in the XSLT development process. Programmers may use the automatically 
generated XSLT stylesheet as a starting point for something more complex. 

In this paper, we present XSLTGen: An Automatic XSLT Generator , a 
novel system that automatically generates an XSLT stylesheet, given a source 
XML document, and a desired output HTML document. The generated XSLT 
stylesheet contains rules for transforming the given XML document to the HTML 
document, and can be applied to other XML documents with similar struc- 
ture. The important feature of this system is that users can generate an XSLT 
stylesheet based solely on their knowledge of XML and HTML, i.e. users only 
need to create a desired output HTML document based on an input XML doc- 
ument. Moreover, users do not have to know anything about the syntax or pro- 
gramming of XSLT, or be aware of the XSLT rule generation process. 

A naive solution to the problem of automatic XSLT generation is to create 
an XSLT stylesheet consisting of only one template rule, which matches the 
XML root element and contains the HTML document markup (i.e., create a 
stylesheet which is very specific to the desired output). This approach has a 
major drawback in terms of reusability, since this kind of stylesheet could not be 
used to transform other XML documents having similar structure as the input 
XML document. In contrast, we are interested in generating a more generic 
stylesheet, which can then be reused to transform other XML documents with 
similar structure. 

This paper shows how XSLT stylesheets can be generated via semantic map- 
pings between the input and output. Our contributions are: 

— We describe text matching and structure matching , techniques for finding 
semantic mappings between an XML and an HTML documents. 

— We introduce sequence checking to the matching context, which enables our 
system to not only discover 1-1 mappings, but also 1-m mappings. 

— We describe a fully automatic XSLT generation system that generates XSLT 
rules based on the semantic mappings found. 

— We describe a technique for refining the XSLT stylesheet generated, which 
examines the differences between the original HTML document and the one 
produced by applying the generated stylesheet back to the XML document. 

— We conduct experiments to validate the matching accuracy and the quality of 
XSLT stylesheets generated by XSLTGen. The results show that XSLTGen 
works well with different varieties of XML and HTML documents. 

— This is the first paper that we are aware of that describes completely auto- 
matic XSLT generation from an XML source and an HTML destination. 



2 XSLTGen System 

We now present an overview of the XSLTGen system. Due to lack of space, 
we describe the necessary algorithms informally. We use Soccer 2 as our running 
example. Before that, we introduce important definitions and terminology used. 

2 http:/ /www. wrox.com/books/0764543814.shtml 
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2.1 Definitions and Terminology 

We assume familiarity with XSLT language. Both XSLT Version 1.0 Specifica- 
tion [3], and XSLT Programmer’s Reference [6] provide good background. 

Let m : ( m.p , to. x, m.h) denote a mapping from an element in the XML doc- 
ument (source) to one or more elements in the HTML document (destination). 
The XML component of m is denoted by m.x, while the HTML component of 
to is denoted by m.h. If m.x is an ATTRiBUTE_NODE, we define the term owner 
node , denoted by m.p, to be the XML node that owns m.x, otherwise it is null. 

Two mappings to-i and TO 2 are distinct if m± .x .name ^ m 2 .x. name and 
mi.h.name ^ m 2 -h. name (where name refers to the tag name of the element). 

An exact mapping is a mapping e, where e.x is a text_NODE or an AT- 
tribute_NODE, e.h is a TEXT_NODE, and the text value of e.x is identical to the 
text value of e.h, i.e. e.x.value = e.h.value. 

A substring mapping is a mapping s, where s.x is a TEXT_NODE or an AT- 
tribute_NODE, s.h is a TEXT_NODE, s.x.value is a substring of s.h.value, and 

— if s.x.value starts with a non-letter and non-digit character, then the char- 
acter preceding its occurrence in s.h.value (if any) must be either a letter 
or a digit; otherwise, it must be both non- letter and non-digit. And, 

— if s.x.value ends with a non-letter and non-digit character, then the character 
following its occurrence in s.h.value (if any) must be either a letter or a digit; 
otherwise, the following character must be both non- letter and non-digit. 

Special HTML elements are used to separate text in HTML, e.g. br and hr. 
An extra node is a node that does not have any matching node in the other 
document. Extra XML nodes are those XML nodes that are ignored when gen- 
erating the HTML document, while extra HTML nodes are those HTML nodes 
that are added at the time the HTML document was generated and are not 
constructed from any part of the XML document. 

Let N be an XML or HTML node. The precise node of N, precise(N), is 
the node used to represent a transformation in the XSLT template rule. For 
a mapping m, if m.x is a TEXT_NODE, precise(m.x) is the parent of m.x and 
precise(m.p) is null, whereas if m.x is an ATTRiBUTE_NODE, precise{m.x) = m.x 
and precise(m.p) = m.p. The precise node for m.h is: 

— If the next sibling of m.h is a special HTML element_node h s , then 
precise(m.h) = m.h ++ h s 

— If m.h has no next sibling or the next sibling of m.h is not special and m.h 
has ELEMENT-NODE siblings, then precise(m.h) = m.h 

— Otherwise, precise{m.h ) is the highest ancestor of m.h such that each node 
on the path between precise(m.h ) and m.h only has one non-extra child. 

The sequence of N, seq(N), is a DTD of N if N is the root of a docu- 
ment. Otherwise, let D' be a DTD of the parent of N. seq(N) is then equal to 
trimjsi{D'), where trim n is a function which removes both the largest prefix not 
containing N and the largest suffix not containing N from its argument. 
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2.2 System Architecture 

The architecture of the XSLTGen system is illustrated in Fig. 1. Two documents, 
a source XML document and a desired target HTML document, are given to 
XSLTGen in order to initiate the stylesheet generation process. The output of 
XSLTGen is an XSLT stylesheet consisting of rules for transforming the given 
XML document to the supplied HTML document. As shown in the figure, the 
system consists of six main components described next. Fig. 2 shows an example 
of fragments of XML DOM and HTML DOM for the Soccer example. 




Fig. 1. Architecture of the XSLTGen system 




Text Matching. The goal of the text matching subsystem is to discover a set of 
exact mappings and substring mappings. As the content of the HTML document 
is created based on the content of the XML document, it is important to find 
both exact and substring mappings between the two documents, since there must 
be HTML elements that have the same string or substring as the XML elements. 

The starting point to find a mapping is to compare nodes having a value at- 
tribute, i.e. TEXT_NODEs and ATTRiBUTE_NODEs. In this paper, we only compare 
an XML TEXT_NODE with an HTML TEXT_NODE and an XML ATTRIBUTE_node 
with an HTML text_NODE. We ignore HTML ATTRiBUTE_NODEs since the at- 
tribute value of those nodes is usually specific to the display of the HTML 
document in the Web browsers and is not generated from the text within the 
XML document. In the Soccer example of Fig. 2, 
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— Exact mapping occurs between XML text_NODE “10-Jun-98” and HTML 
text_node “10-Jun-98”, because their values are the same. 

— Substring mapping occurs between XML text_NODE “A” and HTML text_ 
NODE “Matches in Group A”, since only part of the value of the HTML 
text_node, i.e. “A”, is identical to the value of the XML text_node. 

As mentioned earlier, text matching takes two inputs: an XML DOM x and 
an HTML DOM h. It discovers as many text mappings as possible between the 
nodes in x and h. The text matching procedure is called twice, once to discover 
all EXACT mappings between the nodes in x and h and again to discover all 
SUBSTRING mappings. The output of our text matching algorithm is: a list of 
EXACT mappings (Me) and a list of substring mappings (Mg). 

Our text matching process is implemented using a top-down approach by vis- 
iting each node in the XML DOM in pre-order and using the same traversal to 
find a matching node in the HTML DOM. Note that in order to create an XSLT 
template rule, the XML node should be an ELEMENT_NODE (not a text node). 
Therefore, we need to determine for each exact and substring mapping found, the 
precise node of its XML and HTML component that best describes the transfor- 
mation. This approach allows us to discover more precise mappings between the 
XML and HTML documents. Moreover, we require that every HTML node that 
has been matched to an XML node during the exact matching process, cannot 
be considered as a matching candidate in the substring matching process. 

In Fig. 2, some of the text mappings are: (<soccer>. . . </soccer>,group="A", 
<hl>Matches in Group A</hl>); (mdb < ‘team>Brazil</teain>,<td>Brazil</td>); and 
(null, <team>Brazil</team>,<h2>Brazil vs Scotland</h2>). 



Structure Matching. This subsystem discovers all structure mappings be- 
tween elements in the XML and HTML DOMs. We adopt two constraints used 
in GLUE system [4] as a guide to determine whether two nodes are structurally 
matched: 

— Neighbourhood Constraint: “two nodes match if nodes in their neighbourhood 
also match”, where the neighbourhood is defined to be the children. 

— Union Constraint: “if every child of node A matches node B , then node A 
also matches node B" . 

Note that there could be a range of possible matching cases, depending on 
the completeness and precision of the match. In the ideal case, all components 
of the structures in the two nodes fully match. Alternatively, only some of the 
components are matched (a partial structural match). In the case of partial 
structure matching between two nodes, there are some extra nodes, i.e. children 
of the first node that do not match with any children of the second node; and/or 
vice versa. Since extra nodes do not have any match in the other document, they 
are ignored in the structure matching process. Therefore, the above constraints 
need to be modified to construct the definition of structure matching which 
accommodates both partial and full structure matching: 



484 Stella Waworuntu and James Bailey 



— Neighbourhood Constraint: “XML node X structurally matches HTML node 
H if H is not an extra HTML node and every non- extra child of H either 
text matches or structurally matches a non- extra descendant of X”. 

— Union Constraint: “X structurally matches H if every non- extra child of H 
either text matches or structurally matches X” . 

As stated in the above constraints, we need to examine the children of the 
two nodes being compared in order to determine if a structure matching exists. 
Therefore, structure matching is implemented using a bottom-up approach that 
visits each node in the HTML DOM in post-order and searches for a matching 
node in the entire XML DOM. If the list of substring mappings Msm is still 
empty after the structure matching process finishes, we add a mapping from the 
XML root element to the HTML body element, if it exists, or to the HTML 
root element, otherwise. Revisiting the Soccer example (Fig. 2), some of the 
discovered structure mappings are: (null, match, tr) (neighbourhood constraint), 
(null, match, table) (union constraint) and (null, soccer, body). 

Sequence Checking. Up to this point, the mappings generated by the text 
matching and structure matching subsystems are limited to 1-1 mappings. In 
cases where the XML and HTML documents have more complex structure, these 
mappings may not be accurate and this can affect the quality of the XSLT rules 
generated from these mappings. Consider the following example: 

In Fig. 2, the sequence of the children of XML node soccer is made up of 
nodes with the same name, match; whereas the sequence of the children of the 
matching HTML node body follows a specific pattern: it starts with hi and is 
followed repetitively by h2 and table. Using only the discovered 1-1 mappings, 
it is not possible to create an XSLT rule for soccer that resembles this pattern, 
since match maps only to table according to structure matching. In other words, 
there will be no template that will generate the HTML node h2. 

Focusing on the structure mapping (match, table) and the substring mappings 
{(teamfl] ,h2), (team[2] ,h2)}, we can see in the DOM trees that the children of 
match, i.e. teamfl] and team [2], are not mapped to the descendant of table. 
Instead, they map to the sibling of table, i.e. h2. Normally, we expect that the 
descendant of match maps only to the descendant of table, so that the notion of 
1-1 mapping is kept. In this case, there is an intuition that match should not only 
map to table, but also to h2. In fact, match should map to the concatenation 
of nodes h2 and table, so that the sequence of the children of body is preserved 
when generating the XSLT rule. This is called a 1-m mapping, where an XML 
node maps to the concatenation of several HTML nodes. 

The 1-m mapping (match, h2 ++ table) can be found by examining the subele- 
ment sequence of soccer and the subelement sequence of body described above. 
Note that the subelement sequence of a node can be represented using a regular 
expression, which is a combination of symbols representing each subelement and 
metacharacters: |, *,+,?,(, ). To obtain this regular expression, Xtract [5], a 
system for inferring a DTD of an element, is employed. In our example, the regu- 
lar expression representing the subelement sequence of soccer is match*, whereas 
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the one representing subelement sequence of body is hi(h 2 , table)*. We then check 
whether the elements in the first sequence conform to the elements in the second 
sequence, as follows: According to the substring mapping (soccer, group, hi), ele- 
ment hi conforms to an attribute of soccer and thus, we ignore it and remove hi 
from the second sequence. Comparing match * with (h 2 , table)*, we can see that 
element match should conform to elements (h 2 , table) since the sequence match* 
corresponds directly to the sequence (h 2 , table)*, i.e. they both are in repetitive 
pattern, denoted by *. However, element match conforms only to element table, 
as indicated by the structure matching (match, table). The verification therefore 
fails, which indicates that the structure matching (match, table) is not accurate. 
Consequently, based on the sequences match* and (h 2 , table)*, we deduce the 
accurate 1-m mapping: (null, match, h2 ++ table). 

The main objective of the sequence checking subsystem is to discover 1-m 
mappings using the technique of comparing two sequences described above. 

XSLT Stylesheet Generation. This subsystem constructs a template rule for 
each mapping discovered in Me, Msm, and Mom (the list of 1-m mappings); 
and puts them together to compose an XSLT stylesheet. We do not consider 
the substring mappings in Ms, because in substring mappings, it is possible to 
have a situation where the text value of the HTML node is a concatenation 
of text values from two or more XML nodes. Hence, it is impossible to create 
template rules for those XML nodes. Moreover, the HTML text value may con- 
tain substrings that do not have matching XML text value (termed as extra 
string). Considering these situations, we implement a procedure that generates 
a template for each distinct HTML node in Mg. 

The XSLT stylesheet generation process begins by generating the list of sub- 
string rules. We then construct a stylesheet by creating the <xsl : stylesheet> 
root element and subsequently, filling it with template rules for the 1-m map- 
pings in Mom, the structure mappings in Msm and the exact mappings in Me- 
The template rules for the 1-m mappings have to be constructed first, since 
within that process, they may invalidate several mappings in Msm and Me and 
thus, the template rules for those omitted mappings do not get used . In each 
mapping list (Mom, Msm, and Me), the template rule is constructed for each 
distinct mapping, to avoid having some conflicting template rules. 

In the next three subsections, we give more detail on the XSLT generation 
process. Discussion of 1-m mappings is left out due to space constraints. 

Substring Rule Generation. The substring rule generator creates a template from 
an XML node or a set of XML nodes to each distinct HTML node presented in 
the substring mapping list Mg. The result of this subsystem is a list of substring 
rules SUB_RULES, where each element of the list is a tuple (htmljnode, rule). 
Due to space constraints, we omit the detailed description of our algorithm 
for generating the substring rule itself. The following example illustrates how 
substring rule generation works. 

Consider the following substring mappings discovered in the Soccer example 
(Fig. 2): (mtZZ, < 'teain>Scotlaiid</teain>,<h2>Brazil vs Scotland</h2>) and (null, 
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<team>Brazil</team>,<h2>Brazil vs Scotland</h2>). The HTML string is 
“Brazil vs Scotland”, while the set of XML strings is {“Brazil, “Scotland”}. 
By replacing parts of the HTML string that appear in the set of XML strings 
with the corresponding XSLT instruction, the substring rule is: 

<xsl : value-of select="team[l] "/> vs <xsl : value-of select="team[2] "/>, 
where “vs” is an extra string. 

Constructing a Template Rule for an Exact Mapping in Me- Each template rule 
begins with an XSLT <xsl :template> element and ends by closing that element. 
For a mapping m, the pattern of the corresponding template rule is m.x.name 
and the only XSLT instruction used in the template is <xsl: value-of >. In this 
procedure, we only construct a template rule when m is a mapping between an 
XML element_node to an HTML node or a concatenation of HTML nodes. 
The reason that we ignore mappings involving XML ATTRIBUTE_nodes is that 
the template for this mapping will be generated directly within the construction 
of the template rule for structure mapping and 1-m mapping. In text matching, 
there could be mappings from an XML node to a concatenation of HTML nodes, 
hence, we need to create a template for each HTML node hi in m.h. E.g., the 
template rule for an exact mapping (null, line, text () ++ br) is: 

<xsl :template match="line"> 

<xsl : value-of select=" . "/><br/> 

</xsl : template> 

Constructing a Template Rule for a Structure Mapping in M$m ■ Recall that in 
structure matching, one of the mappings in M$m must be the mapping whose 
XML component is the root of the XML document. Let r denote this special 
mapping. The template for r begins with copying the root of the HTML docu- 
ment and its subtree, excluding the HTML component r.h and its subtree. 

The next step in constructing the template for mapping r follows the steps 
performed for the other mappings in Msm- For any mapping m in M$m, the 
opening tag for m.h is created, then a template for each child Cj of m.h is cre- 
ated, and finally, the m.h tag is closed. E.g., Suppose there is a structure map- 
ping (null, match, table) discovered in Soccer (Fig. 2). And suppose we have ex- 
act mappings (md/,date,td[l]), (null, team [1] ,td [2] ), (mdZ,team[2] ,td[3] ) in Me- 
The template rule representing this structure mapping is: 

<xsl :template match="match"> 

CtrXxsl : apply-templates select=" . /date"/> 

<xsl : apply-templates select=" . /teamfl] "/> 

<xsl : apply-templates select=" . /team[2] "/></tr> 

</xsl : template> 

Refining the XSLT Stylesheet. In some cases, the (new) HTML document 
obtained by applying the generated XSLT stylesheet to the XML document may 
not be accurate, i.e. there are differences between this (new) HTML document 
and the original (user-defined) HTML document. By examining those differ- 
ences, we can improve the accuracy of the XSLT stylesheets generated. This 
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step is applicable when we have a set of complete and accurate mappings be- 
tween the XML and HTML documents, but the generated XSLT stylesheet is 
erroneous. If the discovered mappings themselves are incorrect or incomplete, 
then this refinement step is not effective and it is better to address the problem 
by improving the matching techniques. An indicator that we have complete and 
accurate mappings is that each element in the new HTML document corresponds 
exactly to the element in the original HTML document at the same depth. 

One possible factor that can cause the generated XSLT stylesheet to be 
inaccurate, is the wrong ordering of XSLT instructions within a template. This 
situation typically occurs when we have XML nodes with the same name but 
different order or sequence of children. Therefore, the main objective of the 
refinement step is to fix the order of the XSLT instructions within the template 
matches of the generated XSLT stylesheet, so that the resulting HTML document 
is closer to or exactly the same as the original HTML document. 

A naive approach to the above problem is to use brute force and attempt all 
possible orderings of instructions within templates until the correct one is found 
(there exist no differences between the new HTML and the original HTML). 
However, this approach is prohibitively costly. Therefore, we adopt a heuristic 
approach, which begins by examining the differences between the original HTML 
document and the one produced by the generated XSLT stylesheet. We employ 
a change-detection algorithm [2], that produces a sequence of edit operations 
needed to transform the original HTML document to the new HTML document. 
The types of edit operations returned are insert, delete, change, and move. 

To carry out the refinement, the edit operation that we focus on is the move 
operations, since we want to swap around the XSLT instructions in a template 
match to get the correct order. In order for this to work, we require that there are 
no missing XSLT instructions for any template match in the XSLT stylesheet. 

After examining all move operations, this procedure is started over using the 
fixed XSLT stylesheet. This repetition is stopped when no move operations are 
found in one iteration; or, the number of move operations found in one iteration 
is greater than those found in the previous iteration. The second condition is 
there to prevent the possibility of fixing the stylesheet incorrectly. We want the 
number of move operations to decrease in each iteration until it reaches zero. 



3 Empirical Evaluation 

We have conducted experiments to study and measure the performance of XSLT- 
Gen. To give the reader some idea on how our system performs, we evaluated 
XSLTGen on four examples taken from a popular XSLT book 3 and a real-life 
data taken from MSN Messenger chat history. These datasets exhibit a wide 
variety of characteristics ranging from 10 - 244 element nodes. Originally, they 
were pairs of (XML document, XSLT stylesheet). To get the HTML document 
associated with each dataset, we apply the original XSLT stylesheet to the XML 

3 http://www.wrox.com/books/0764543814.shtml 
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document using Xalan 4 XSLT processor. We then manually determined the cor- 
rect mappings between the XML and HTML DOMs in each dataset. 

For each dataset, we applied XSLTGen to find the mappings between the 
elements in the XML and HTML DOMs, and generate the corresponding XSLT 
stylesheet. We then measured the matching accuracy , i.e. the percentage of the 
manually determined mappings that XSLTGen discovered, and the quality of 
the XSLT stylesheet inferred by XSLTGen. To evaluate the quality of the XSLT 
stylesheet generated by XSLTGen in each dataset, we applied the generated 
XSLT stylesheet back to the XML document using Xalan and then compared the 
resulting HTML with the original HTML document using HTMLDiff 5 . HTMLD- 
iff is a tool for analysing changes made between two revisions of the same file. 
It is commonly used for analysing HTML and XML documents. The differences 
may be viewed visually in a browser, or be analysed at the source level. 

The results of the matching accuracy are impressive. XSLTGen achieves high 
matching accuracy across all five datasets. Exact mappings reach 100% accuracy 
in four out of five datasets. In the dataset Chat Log , exact mappings reach 86% 
accuracy. This is caused by the undiscovered mappings from XML attribute 
nodes to HTML attribute nodes, which violates our assumption in Sec- 
tion 2.2 that the value of an HTML attribute_node is usually specific to the 
display of the HTML document in Web browsers and is not generated from a 
text within the XML document. Substring mappings achieve 100% accuracy in 
the datasets Itinerary and Soccer. In contrast, substring mappings achieve 0% 
accuracy in the dataset Poem. This poor performance is caused by incorrectly 
classifying substring mappings as exact mappings during the text matching pro- 
cess. In the datasets Books and Chat Log , substring mappings do not exist. 
Structure mappings achieve perfect accuracy in all datasets except Poem. In 
the dataset Poem , structure mappings achieves 80% accuracy because an XML 
node is incorrectly matched with an HTML text_NODE in text matching, while 
it should be matched with other HTML node in structure matching. Follow- 
ing the success of the other mappings, 1-m mappings achieve 100% accuracy in 
the datasets Itinerary and Soccer. In the datasets Books, Poem and Chat Log , 
there are no 1-m mappings. This results indicate that in most of these cases, the 
XSLTGen system is capable of discovering complete and accurate mappings. 

The results returned by HTMLDiff are also impressive. The new HTML 
documents have a very high percentage of correct nodes. In the datasets Itinerary 
and Soccer , the HTML documents being compared are identical, which is shown 
by the achievement of 100% in all types of nodes. In the dataset Poem , the 
two HTML documents have exactly the same appearance in Web browsers, but 
according to HTMLDiff, there are some missing whitespaces in each line within 
the paragraphs of the new HTML document. That is why the percentage of 
correct text_NODEs in this dataset is very low (14%). The reason of this low 
percentage is that in the text matching subsystem, we remove the leading and 
trailing whitespaces of a string before the matching is done. The improvement 



4 http:/ /xml. apache. org/xalan-j/index.html 

5 http:/ /www. componentsoftware.com/products/HTMLDiff/ 
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stage does not fix the stylesheet since there are no move operations. In the 
dataset Books , the difference occurs in the first column of the table. In the 
original HTML document, the first column is a sequence of numbers 1, 2, 3 and 
4; whereas in the new HTML document, the first column is a sequence of Is. 
The numbers 1,2,3 and 4 in the original HTML document are represented using 
four extra nodes. However, our template rule constructor assumes that all extra 
nodes that are cousins (their parent are siblings and have the same node name) 
have the same structure and values. Since the four extra nodes have different 
text values in this dataset, the percentage of correct text_NODEs in the new 
HTML document is slightly affected (86%). Lastly, the differences between the 
original and the new HTML documents in the dataset Chat Log are caused by 
the undiscovered mappings mentioned in the previous paragraph. Because of 
this, it is not possible to fix the XSLT stylesheet. However, the percentage of 
correct attribute_nodes is still acceptable (75%). 

We have tested XSLTGen on many other examples and the results are very 
similar to those obtained in this experiment. However, there are some problems 
that prevent XSLTGen from obtaining even higher matching accuracy. First, in 
a few cases, XSLTGen is not able to discover some mappings between XML AT- 
tribute_nodes and HTML attributejmodes because these mappings violate 
our assumption stated in Section 2.2. This problem can be alleviated by consider- 
ing HTML ATTRiBUTE_NODEs in the matching process. Undiscovered mappings 
are also caused by incorrectly matching some nodes, which is the second prob- 
lem faced in the matching process. Incorrect matchings typically occur when 
an XML or an HTML text_NODE has some ELEMENT_NODE siblings. In some 
cases, these nodes should be matched during the text matching process, while in 
other cases they should be matched in structure matching. Here, the challenge 
will be in developing matching techniques that are able to determine whether 
a text_NODE should be matched during text matching or structure matching. 
The third problem concerns with incorrectly classified mappings. This problem 
only occurs between a substring mapping and an exact mapping, when the com- 
pared strings have some leading and trailing whitespaces. Determining whether 
whitespaces should be kept or removed is a difficult choice. 

Besides this, as the theme of our text matching subsystem is text-based 
matching (matching two strings), the performance of the matching process de- 
creases if the supplied documents contain mainly numerical data. In this case, 
the mappings discovered, especially substring mappings, are often inaccurate 
and conflicting, i.e. more than one HTML node is matched with an XML node. 

Finally, the current version of XSLTGen does not support the capability to 
automatically generate XSLT stylesheets with complex functions (e.g. sorting). 
This is a very challenging task and an interesting direction for future work. 



4 Related Work 

There is little work in the literature about automatic XSLT stylesheets genera- 
tion. The only prior work of which we are aware of is XSLbyDemo [10], a system 
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that generates an XSLT stylesheet by example. In this system, the process of 
generating XSLT stylesheet begins with transforming the XML document to an 
initial HTML page , which is an HTML page using a manually created XSLT 
stylesheet, taking into account the DTD of the XML document. The user then 
modifies the initial HTML page using a WYSIWYG editor and their actions are 
recorded in an operation history. Based on the user’s operation history, a new 
stylesheet is generated. Obviously, this system is not automatic, since the user 
is directly involved at some stages of the XSLT generation process. Hence, it is 
not comparable to our fully automatic XSLTGen system. Specifically, our ap- 
proach differs from XSLbyDemo in three key ways: (i) Our algorithm produces 
a stylesheet that transforms an XML document to an HTML document, while 
XSLbyDemo generates transformations from an initial HTML document to its 
modified HTML document, (ii) Our generated XSLT can be applied directly to 
other XML documents from the same document class, whereas using XSLby- 
Demo, the other XML documents have to be converted to their initial HTML 
pages before the generated stylesheet can be applied, (iii) Finally, our users do 
not have to be familiar with a WYSIWYG editor and the need of providing 
structural information through the editing actions. The only thing that they 
need to possess is knowledge of a basic HTML tool. 

In the process of generating XSLT, semantic mappings need to be found. 
There are a number of algorithms available for tree matching. Work done in [12, 
13] on the tree distance problem or tree-to-tree correction problem and in [2] 
known as the change-detection algorithm, compare and discover the sequence of 
edit operations needed to transform the source tree into the result tree given. 
These algorithms are mainly based on structure matching, and their input com- 
prises of two labelled trees of the same type, i.e. two HTML trees or two XML 
trees. The text matching involved is very simple and limited since it compares 
only the labels of the trees. Clearly, these algorithms do not accommodate our 
needs, since we require an algorithm that matches an XML tree with an HTML 
tree. However, these algorithms are certainly useful in our refinement stage since 
within that subsystem, we are comparing two HTML documents. 

In the field of semantic mapping, a significant amount of work has focused 
on schema matching (refer to [11] for survey). Schema matching is similar to 
our matching problem in the sense that two different schemas are compared, 
which have different sets of element names and data instances. However, the two 
schemas being compared are mostly from the same domain and therefore, their 
element names are different but comparable. Besides using structure matching, 
most of the schema mapping systems rely on element name matchers to match 
schemas. The TransSCM system [9] matches schema based on the structure and 
names of the SGML tags extracted from DTD files by using concept of labelled 
graphs. The Artemis system [1] measures similarity of element names, data types 
and structure to match schemas. In XSLTGen, it is impossible to compare the 
element names since XML and HTML have completely different tag names. 

XMapper [7] is another system for finding semantic mappings between struc- 
tured documents within a given domain, particularly XML sources. This system 
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uses an inductive machine learning approach to improve accuracy of mappings 
for XML data sources, whose data types are either identical or very similar, and 
the tag names between these data sources are significantly different. In essence, 
this system is suitable for our matching process in XSLTGen since the tag names 
of XML and HTML documents are absolutely different. However, this system 
requires the user to select one matching tag between two documents, which 
violates our principle intention of creating a fully automatic system. 

Recent work in the area of ontology matching also focuses on the problem 
of finding semantic mappings between two ontologies. One ontology matching 
system that we are aware of is GLUE system [4]. GLUE also employs machine 
learning techniques to semi-automatically create such semantic mappings. Given 
two ontologies: for each node in one ontology, the purpose is to find the most 
similar node in the other ontology using the notions of Similarity Measures and 
Relaxation Labelling. Similar to our matching process, the basis used in the 
similarity measure and relaxation labelling are data values and the structure 
of the ontologies, respectively. However, GLUE is only capable of finding 1-1 
mappings whereas our XSLTGen matching process is able to discover not only 1- 
1 mappings but also 1-m and sometimes m-1 mappings (in substring mappings). 

The main difference between mapping in XSLTGen and other mapping sys- 
tems, is that in XSLTGen we believe that mappings exist between the elements 
in the XML and HTML documents, since the HTML document is derived from 
the XML document by the user; whereas in other systems, the mapping may not 
exist. Moreover, the mappings generated by the matching process in XSLTGen 
are used to generate code (an XSLT stylesheet) and that is why the mappings 
found have to be accurate and complete, while in schema matching and on- 
tology matching, the purpose is only to find the most similar nodes between 
the two sources, without further processing of the results. To accommodate the 
XSLT stylesheet generation, XSLTGen is capable of finding 1-1 mappings, 1-m 
mappings and sometimes m-1 mappings; whereas the other mapping systems 
focus only on discovering 1-1 mappings. Besides this, the matching subsystem 
in XSLTGen has the advantage of having very similar and related data sources, 
since the HTML data is derived from the XML data. Hence, they can be used as 
the primary basis to find the mappings. In other systems, the data instances in 
the two sources are completely different, the only association that they have is 
that the sources come from the same domain. Following this argument, XSLTGen 
discovers the mappings between two different types of document, i.e. an XML 
and an HTML document, whereas the other systems compare two documents of 
the same type. Finally, another important aspect which differs XSLTGen from 
several other systems, is that the process of discovering the mappings which will 
then be used to generate XSLT stylesheet is completely automatic. 



5 Conclusion 

With the upsurge in data exchange and publishing on the Web, conversion of 
data from its stored representation (XML) to its publishing format (HTML) 
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is increasingly important. XSLT plays a prominent role in transforming XML 
documents into HTML documents. However, it is difficult for users to learn. 

We have devised XSLTGen, a system for automatically generating an XSLT 
stylesheet, given an XML document and its corresponding HTML document. 
This is useful for helping users to learn XSLT. The main strong characteris- 
tics of the generated XSLT stylesheets are accuracy and reusability. We have 
described how the text matching, structure matching and sequence checking en- 
ables XSLTGen to discover not only 1-1 semantic mappings between the elements 
in the XML document and those in the HTML document, but also 1-m mappings 
and sometimes m-1 mappings. We have also described a fully automatic XSLT 
generation system that generates XSLT rules based on the mappings found. Our 
experiments showed that XSLTGen can achieve high matching accuracy and 
produce high quality stylesheets. 
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Abstract. There is growing evidence that schema-conscious approaches 
are a better option than schema-oblivious techniques as far as XML 
query performance is concerned in relational environment. However, the 
issue of recursive XML queries for such approaches has not been dealt 
with satisfactorily. In this paper we argue that it is possible to de- 
sign a schema-oblivious approach that outperforms schema-conscious ap- 
proaches for certain types of recursive queries. To that end, we propose 
a novel schema-oblivious approach called Sucxent-| — f that outperforms 
existing schema-oblivious approaches such as XParent by up to 15 times 
and schema-conscious approaches (Shared-Inlining) by up to 3 times for 
recursive query execution. Our approach has up to 2 times smaller stor- 
age requirements compared to existing schema-oblivious approaches and 
10% less than schema-conscious techniques. In addition, existing schema- 
oblivious approaches are hampered by poor query plans generated by 
the relational query optimizer. We propose optimizations in the XML 
query to SQL translation process that generate queries with more opti- 
mal query plans. 



1 Introduction 

Recursive XML queries are considered to be quite significant in the context of 
XML query processing [3] and yet this issue has not been addressed satisfactorily 
in existing literature. Recursive XML queries are XML queries that contain the 
descendant axis (//). The use of the ‘//’ is quite common in XML queries due 
to the semi-structured nature of XML data [3]. For example, consider the XML 
document in Figure 2. The element item could occur either under europe or 
africa. Consider the scenario where a user needs to retrieve all item elements. 
The user will have to execute the path expression Q = /site//item. Another 
scenario could be that the document structure is not completely known to the 
user except that each item has a name and price. Suppose, the user needs to 
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find out the price of the item with name "Gold Ignot". Q = //item[name=" Gold 
lgnot”]/price will be the corresponding path expression. 

Efficient execution of XML queries, recursive or otherwise, is largely deter- 
mined by the underlying storage approach. There has been a substantial research 
effort in storing and processing XML data using existing relational databases [1, 
6,2]. These approaches can be broadly classified as: (a) Schema-conscious ap- 
proach: This method first creates a relational schema based on the DTD of 
the XML documents. Examples of such approach is the inlining approach [5]. 
(b) Schema- oblivious approach: This method maintains a fixed schema which is 
used to store XML documents irrespective of their DTD. Examples of schema- 
oblivious approaches are the Edge approach [1], XRel [7] and XParent [2]. 
Schema-oblivious approaches have obvious advantages such as the ability to 
handle XML schema changes better as there is no need to change the relational 
schema and a uniform query translation approach. Schema-conscious approaches, 
on the other hand, have the advantage of more efficient query processing [6]. 
Also, no special relational schema needs to be designed for schema-conscious 
approaches as it can be generated on the fly based on the DTD of the XML 
document (s). 

In this paper, we present an efficient approach to process recursive XML 
queries using a schema-oblivious approach. At this point, one would question 
the justification of this work for two reasons. First, this issue may have already 
been addressed. Surprisingly, this is not the case as highlighted in [3]. Second, a 
growing body of work suggests that schema-conscious approaches perform bet- 
ter than schema-oblivious approaches. In fact, Tian et al. have demonstrated in 
[6] that schema-conscious approaches generally perform substantially better in 
terms of query processing and storage size. However, the Edge approach [1] was 
used as the representative schema-oblivious approach for comparison. Although 
the Edge approach is a pioneering relational approach, we argue that it is not a 
good representation of the schema-oblivious approach as far as query processing 
is concerned. In fact, XParent [2] and XRel [7] have been shown to outper- 
form the Edge approach by up to 20 times, with XParent outperforming XRel 
[2]. However, this does not mean that XParent outperforms schema-conscious 
approaches. In fact as we will show in Section 6, schema-conscious approaches 
still outperform XParent. Hence, it may seem that schema-conscious generally 
outperforms schema-oblivious in terms of query processing. In this paper we ar- 
gue that it is indeed possible to design a schema-oblivious approach that can 
outperform schema-conscious approaches for certain types of recursive queries. 

To justify our claim, we propose a novel schema-oblivious approach, called 
Sucxent-| — |- (Schema Unconcious XML Enabled System (pronounced “suc- 
cinct++”)), and investigate the performance of recursive XML queries. We only 
store the leaf nodes and the associated paths together with two additional at- 
tributes for efficient query processing (details follow in Section 3). Sucxent-I — |- 
outperforms existing schema-oblivious techniques, such as XParent, by up to 
15 times and slrared-inlining - a schema-conscious approach - by up to 3 times 
for recursive queries with characteristics described in Section 6. In addition, 
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<! ELEMENT site (regions , ...) > 
<!ELEMENT regions (africa, europe, 

. . . )> 

<! ELEMENT africa (item*)> 

<! ELEMENT europe (item*)> 

<! ELEMENT item 

(name , price , description) > 

<! ELEMENT name (#PCADATA)> 

<! ELEMENT price (#PCDATA)> 



site 

regions 

K 1 * 

africa europe "■ 

4 A * 

item 

name price 

Fig. 1. Sample DTD. 



Level 3 
europe 



Level 3 
africa 



name price description nam e price description name price description name price description 

| "Gold Ignot" | | $100 | I "Item 1" | | $10 | |» Item2 " | \$hT\ |"Item3" | | $30 | 

© ' 



© _ . 

©text keyword text r keyword 

Leaf Order |"descl" | |"kwdl" | |"desc2" | P ar '' st |"kwd2" | 



listitem 

text parlist 



text 



keyword 

*desc3" II |M rlisl l-kiwB*' II 

listitem 

text parlist 



text keyword 

"desc4" I I"kwd4" I 



|"..qold.." I 



Fig. 2. Sample XML document. 



SucxentH — b can reconstruct shredded documents up to 2 times faster than 
Shared- Inlining. The main reasons Sucxent-) — b performs better than exist- 
ing approaches are 1) Significantly lower storage size and, consequently, lower 
I/O-cost associated with query processing, 2) Fewer number of joins in the cor- 
responding SQL queries and, 3)Additional optimizations, discussed in Section 5, 
that are made to improve the query plan generated by the relational query 
optimizer. In summary, the main contributions of this paper are: (1) A novel 
schema-oblivious approach whose storage size depends only on the number of 
leaf nodes in the document. (2) Optimizations to improve the query plan gener- 
ated by the relational query optimizer. Traditional schema-oblivious approaches 
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have been hampered by the poor query plan selection of the underlying relational 
query optimizer [6, 8]. (3) To the best of our knowledge, this is the first attempt 
to show that it is indeed possible to design a schema-oblivious approach that 
can outperform schema-conscious approaches as far as the execution of certain 
types of recursive XML queries is concerned. 

2 Related Work 

All existing schema-oblivious approaches store, at the very least, every node 
in the XML document. The Edge approach [1] essentially captures edge in- 
formation of the tree that represents the XML document. However, resolving 
ancestor-descendant relationships requires the traversal of all the edges from 
the ancestor to the descendant (or vice-versa). The system proposed by Zhang 
et. al in [8] labels each node with its preorder and postorder traversal num- 
bers. Then, ancestor-descendant relationships can be resolved in constant time 
using the property preorder (ancestor) < preorder (descendant) and postorder 
(ancestor) > postorder (descendant). It still results in as many joins as there 
are path separators. 

To solve the problem of multiple joins, XRel [7] stores the path of each node 
in the document. Then, the resolution of path expressions only requires the paths 
(which can be represented as strings) to be matched using string matching oper- 
ators. However, the XRel approach still makes use of the containment property 
mentioned above to resolve ancestor-descendant relationships. It involve joins 
with 0(< or >) operators that have been shown to be quite expensive due to 
the manner in which an RDBMS processes joins [8]. In fact, special algorithms 
such as the Multi-predicate merge sort join algorithm [8] have been proposed 
to optimize these operations. However, to the best of our knowledge there is no 
off-the-shelf RDBMS that implements these algorithms. 

XParent [2] solves the problem of 0-joins by using an Ancestor table that 
stores all the ancestors of a particular node in a single table. It then replaces 
0-joins with equi- joins over this set of ancestors. However, this approach results 
in an explosion in the database size as compared to the original document. 
The number of relational joins is also quite substantial. XParent requires a join 
between the LabelPath. DataPath, Element and Ancestor tables for each path in 
the query expression. The joins are quite expensive especially when the Ancestor 
table is involved as it can be quite large in size. 

Sucxent-| — |- is different from existing approaches in that it only stores leaf 
nodes and their associated paths. We store two additional attributes, called 
BranchOrder and BranchOrderSum, for each leaf node that capture the relation- 
ship between leaf nodes. Essentially, they allow the determination of common 
nodes between the paths of any two leaf nodes in a constant time. This results 
in a substantial reduction in storage size and query processing time. In addition, 
we propose optimizations that enable the underlying relational query optimizer 
to generate near-optimal query plans for our approach, resulting in a substantial 
performance improvement. Our studies indicate that these optimizations can be 
applied to other schema-oblivious approaches as well. 
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Fig. 3. XParent schema. 



Fig. 4. SucxentH — b schema. 
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Fig. 5. SucxentH — H XML data in RDBMS. 



Schema-oblivious approaches are not influenced by recursion in the schema. 
However, the Edge approach uses recursive SQL queries using the SQL99 with 
construct to evaluate recursive XML queries. XParent and XRel handle recursive 
queries like any other query. Unlike these schema-oblivious approaches, schema- 
conscious strategies have to treat recursion in both schema and queries as special 
cases. In [3] , the authors propose a generic algorithm to translate recursive XML 
queries for schema-conscious approaches using the SQL99 with construct. How- 
ever, no performance evaluation of the resulting SQL queries is presented and it 
is assumed that schema-conscious approaches will outperform schema-oblivious 
approaches. SucxentH — b also treats recursive XML queries like other queries. 
It also implements optimizations to generate SQL translations of recursive XML 
queries that enable the relational query optimizer to produce better query plans 
resulting in significant performance gains. 
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3 Storing XML Data 

In this section, we first discuss the SucxentH — \- schema. This will be followed 
by a formal algorithm to reconstruct XML documents from their relational form. 
The document in Figure 2 is used as a running example. 

3.1 SUCXENT++ Schema 

The schema is shown in Figure 4 and the shredded document in Figure 5. The 
semantics of the schema is as follows. The Document table is used for storing the 
names of the documents in the database. Each document has unique id recorded 
in DocID. Path is used to record the path of all the leaf nodes. For example, the 
path of the first leaf node name in Figure 2 is /site/regions/europe/item/ 
name. This table maintains path_ids, relative path expressions and their length 
recorded as instances of PathID, PathExp and Length respectively. This is to 
reduce the storage size so that we only need to store path_id in the PathValue 
table. The Length attribute is useful for resolving recursive queries. 

PathValue stores only the leaf nodes. The DocID attribute indicates which 
XML document a particular leaf node belongs to. The PathID attribute main- 
tains the id of the path of a particular leaf node as stored in Path. Leaf Order 
records the node order of leaf nodes in an XML tree. For example, when the 
sample XML document is parsed, the leaf node name with value "Gold Ignot" 
is encountered as the first leaf node. Therefore, it is assigned a Leaf Order value 
of 1 . BranchOrder of a leaf node is the level at which it intersects the preceding 
leaf node i.e., it is the level of the highest common ancestor of the leaf nodes 
under consideration. Consider the leaf node with Leaf0rder=2 in Figure 2. This 
leaf node intersects the leaf node with LeafOrder=l at the node item which 
is at level 4. So, the BranchOrder value for this node is 4. Similarly, the node 
name with value "Item2" has Branch0rder=2 (intersecting the node to the left 
at regions). PathValue stores the textual content of the leaf nodes in the column 
Leaf Value. The attribute BranchOrder in this table is useful for reconstructing 
the XML documents from their shredded relational format as discussed in Sec- 
tion 3.2. The significance of DocumentRValue and BranchOrderSum in PathValue 
is elaborated in Section 4 and CPathld in Path is discussed in Section 5. For the 
remainder of the paper, we will refer to Leaf Order and BranchOrder as Order 
information. 

3.2 Extraction of XML Documents 

The algorithm for reconstruction is presented in Figure 6. The input to the 
algorithm is a list of leaf nodes arranged in ascending Leaf Order. Each leaf 
node path is first split into its constituent nodes (lines 5 to 7). If the document 
construction has not yet started (line 10) then the first node obtained by splitting 
the first leaf node path is made the root (lines 11 to 15). When the next leaf 
node is processed we only need to look at the nodes after BranchOrder of that 
node as the nodes up to this level have already been added to the document 
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Input: £ = {ni, • • • , n*}, a list of leaf nodes arranged in order of 
Leaf Order values 

Output: D is the document to be returned. 

1: c is an XML node. 

2: c <— (j> 

3: C «— list of XML nodes. 

4: for all rij in £ do 

5: /* /book/authors/author would give p = [book, authors, author]*/ 

6: p is the array of nodes in a path. 

7: p = rii.Path.GetN odesQ 

8: /*s is a counter*/ 

9: s - 0 

10: if c = 4> then 

11: /* c has not been assigned a value yet. */ 

12: c <— new XmlDocumentN ode{ p[0] ) 

13: /* Make c the root. This happens only once. */ 

14: T>.AddNode( c ) 

15: C.Add{ c ) 

16: s <- 1 



17: 

18: 

19: 

20 : 

21 : 

22 : 

23: 

24: 

25: 

26: 

27: 

28: 

29: 

30: 

31: 

32: 



s «— m.BranchOrder() 

end if 

/* Keep only those nodes in C that are 
common between n;_i and n* . */ 

C. Clear Fromlnde x{ s ) 
q is an XML node 

/* assign c to a temporary variable q. 

Need to keep it as the starting node for processing ni+i */ 
q *- c 

while s < p.Length ( ) do 

m *— new XmlDocumentN ode{p[s]) 
q.AppendChild( m ) 

C.Add( m) 

q «- m 

s + + 
end while 



Fig. 6. Extraction algorithm. 



(lines 20 to 22). The remaining are now added to the document (lines 27 to 32). 
Document extraction is completed once all the leaf nodes have been processed. 
In addition to reconstructing the whole document, this algorithm can be used 
to construct a document fragment given a partial list of consecutive leaf nodes. 

4 Recursive Query Processing 

Consider the recursive query XQuery 1 in Figure 7. A tree representation of 
the query is shown in Figure 8. This query returns those price leaf nodes that 
intersect the constraint-satisfying text leaf node at item. Consider how XPar- 
ent resolves this query. The schema for XParent is shown in Figure 3. XParent 
evaluates this query by locating leaf nodes from the Data table that satisfy 
the constraint on text. This involves a join between the Label Path and Data 
to satisfy the path constraint /site/regions/af rica/item//text and a pred- 
icate on the Data to satisfy the value constraint. Next, LabelPath and Data 
tables are joined again to obtain those leaf nodes that satisfy /site/regions/ 
africa/item/price. These two results sets are joined using the Ancestor table 
to find nodes that have a common ancestor at level 4 ( at item). Thus, the 
final SQL query involves five joins - two between the LabelPath and Data, two 
between the Data and Ancestor and one between two Ancestor tables (SQL query 
translation details for XParent can be found in [2]). These joins can be quite 
expensive due to the large size of Ancestor. XRel follows a similar approach to 
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Where 
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pl.PathExp LIKE ' /site/regions/africa/item/%/text' 
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AND p2.PathExp = ' /site/regions/af rica/item/price' 




7 


AND vl.Pathld = pl.Pathld and v2.PathId = p2.PathId 




8 


AND vl. Leaf Value LIKE '%Gold Ignot%' 




9 


AND vl.DocId = v2.DocId and rl.Level=4 




10 


AND abs (vl.BranchOrderSum - v2 . BranchOrderSum) < rl 


RValue 



site 

regions 



africa 




| Contains("Gold Ignot") | 



Fig. 7. Running example. 



Fig. 8. Query Tree. 
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resolving path expressions except that it uses the ancestor-descendant contain- 
ment property instead of an Ancestor table. This produces 0-joins resulting in 
performance worse than XParent. A detailed evaluation of XRel vs. XParent can 
be found in [2]. 

4.1 The Sucxent-I — |- Approach 

In order to reduce the I/O cost involved in query evaluation, Sucxent-I — b only 
stores the leaf nodes of a document. However, the attributes discussed till now 
are insufficient for query processing. The schema needs to be extended as fol- 
lows. An attribute BranchOrderSum, denoted as s n , is assigned to a leaf node 
with Leaf Order n. In addition, we store an attribute RValue, r; in the Doc- 
umentRvalue table for each level, l, in the document. Essentially, these allow 
the determination of common nodes between the paths of any two leaf nodes 
in a constant time. This results in a substantial reduction in storage size and 
query processing time. Given an XML document with maximum depth D the 
RValue and BranchOrderSum assignment is done as follows. (1) RValue is as- 
signed recursively based on the equation: r, = r*+ 1 x Cj+i + 1 where (a) Cfc 
is the maximum number of consecutive leaf nodes with BranchOrder > k (b) 
td — 1- (2) Let us denote the BranchOrder of a node with LeafOrder n as b n . 
Then, the BranchOrderSum of this node is s n = y/z” r& 4 . 

We illustrate the above attributes with an example. Consider the document 
in Figure 2. For simplicity, ignore the parlist element. Then, the depth of the 
document in 6. So, re = 1 and cq = 1. This means that r$ = lxl + 1 = 2. 
The maximum number of consecutive leaf nodes with BranchOrder > 5 is 1. 
Therefore, r 4 = 2x1 + 1 = 3. The maximum number of consecutive leaf 
nodes with BranchOrder > 4 is 3 (e.g., price, text, keyword under the first item 
element). So, r 3 = 3x3 + 1 = 10. BranchOrderSum of the first leaf node is 0. 
Since BranchOrder of the second leaf node is 4 and r 4 = 3, BranchOrderSum 
of the second leaf node is 3. The values for the complete document are shown in 
DocumentRValue and PathValue of Figure 5. 

Lemma 1 . If Sd = |s n — s m | < then nodes with LeafOrclers n and m 
intersect at a level greater than l. That is, |s n — s m | < =+ J(n, m ) > L 

where I(n, m) is the level at which nodes with leaf orders n and m intersect. I 

The proof for the above lemma is not presented here due to space constraints. 
The attributes RValue and BranchOrderSum allow the determination of the inter- 
section level between any two leaf nodes in a more or less constant time, whereas 
in XParent, it depends on the size of the Ancestor and Data tables as a join 
between these tables is required to determine the ancestor node at a particular 
level. This reduces the query processing time drastically. Since this is achieved 
without storing separate ancestor information, the storage requirements are also 
reduced significantly. 

We will now discuss how these attributes are useful in query processing. 
Consider XQueryl. The BranchOrderSum value for the first constraint satisfying 
text is 6. The BranchOrderSum value for the first price node is 3. Also, r% = 
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10. Using the property proven above we conclude that these two nodes have 
ancestors till a level > 3, since 1 3 — 6 1 < 10. Since, item is at level 4 in both 
cases it is clear that they have a common item node and, therefore, satisfy the 
query. Similarly, we can conclude that the first text node and the item node with 
name Item3 intersect at a level > 1 (since r\ = 329 and |85 — 3| < 329) and 
therefore do not form a part of the query result. 

4.2 SQL Translation 

We have implemented an algorithm to translate XQuery queries to SQL in 
SucxentH — h Due to space constraints we discuss the translation procedure 
informally. Consider the recursive query of Figure 7 (XQuery 1) and its cor- 
responding SQL translation (SQL 1). The translation can be explained as fol- 
lows: (1) Lines 5, 7 and 8 translate the part of the query that seeks an en- 
try with contains (text , "Gold Ignot "). Note that we store only the leaf 
nodes, their textual content and path_id) in the PathValue table. The actual 
path expression corresponding to the leaf node is stored in the Path table. 
Therefore, we need to join the two to obtain leaf nodes that correspond to 
the path /site/regions/africa/item//text and contain the phrase "Gold 
Ignot". Notice that the corresponding SQL translation has the LIKE clause to 
resolve the // relationship. This is how recursive queries are handled in Sucx- 
ent-| — h (2) Lines 6 and 7 do the same for the extraction of leaf nodes that 
correspond to the path /site/regions/africa/item/price. (3) Line 9 ensures 
that the leaf nodes extracted in Lines 5 to 8 belong to the same document. (4) 
Line 10 ensures that the two sets of leaf nodes intersect at least level 4. The 
reason a level 4 ancestor is needed is that the two paths in the query inter- 
sect at level 4. It calculates the absolute value of the difference between the 
BranchOrderSum values and ensures that it is below the RValue for level 4. 
(5) Line 1 returns the properties of the leaf nodes corresponding to the price 
element. These properties are needed to construct the corresponding XML frag- 
ment based on the algorithm in Figure 6. Say, the return clause in Figure 7 was 
<item>$b</item>. Then, line 6 in the translation would change to p2.PathExp 
LIKE 1 /site/regions/af rica/ item 0 /,’ to extract all leaf nodes that have paths 
beginning with $b. This way, elements and their children can be retrieved. 

Compared to XParent, Sucxent-I — b uses only the PathValue, Path and Doc- 
umentRValue tables to evaluate a query. The size of the PathValue and Path tables 
is the same as that of the Data and LabelPath tables in XParent. DocumentR- 
Value has the same number of rows as the depth of the document as compared 
to the Ancestor table in XParent which stores the ancestor list of every node in 
the document. This results in substantially better query performance in addition 
to much smaller storage size. 

5 Optimizations 

A preliminary performance evaluation using the above translation procedure 
yielded some interesting results. We checked the query plans generated by the 
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Fig. 9. Initial query plan. 
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Fig. 10. Path optimization. 



Fig. 11. Multiple-queries optimization. 



query optimizer and noticed that the join between the Path and Path Value tables 
took a significant portion of the query processing time. This was because for 
most of the queries this join was being performed last. For example, in SQL 1 of 
Figure 7 the joins in lines 8 to 10 were evaluated first and only then was the join 
between Path and Path Value tables performed. The initial query plan is shown in 
Figure 9. We have not shown the DocumentRValue table in the plan, even though 
the query optimizer includes it, as it does not influence the optimization. The 
two Hash-Joins (labelled 1 and 2) in this plan are both very expensive. The 
first takes the PathValue table (with alias v2) as one of its inputs. The second 
join takes the result of this join as one of its inputs. Both these inputs are quite 
substantial in size resulting in very expensive join operations. In order to improve 
the above query plans we propose three optimizations that are discussed below. 

Optimization for Simple Path Expressions. The join expression vl .Pathld 
= pi. Id and pl.PathExp = path is replaced with vl. Pathld = n where n is 
the Pathld value corresponding to path in the table Path. Similarly, vl . Pathld 
= pi. Id and pl.PathExp LIKE path % is replaced with vl. Pathld >= n and 
vl . Pathld <= m. For the second case Pathlds are assigned in lexicographic order 
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Fig. 12. DTD graph and path numbering. 
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SELECT 
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1.1 
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vl . * into tmpl from PathValue vl. Path pi 
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Where pl.PathExp = ' /site/ regions/africa/item/%/ text ' 


4 


AND vl. Pathld = pi. Pathld 
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AND vl. Leaf Value LIKE '%Gold Ignot%' 






1 


SELECT 


SQL 


1.2 


2 


t2 . * from tmpl tl, tmp2 t2 , DocumentRValue 


rl 




3 


Where tl.DocId = t2.DocId and tl.DocId = rl 


.Dodd 




4 


AND rl . Level = 4 AND 






5 


abs ( tl . BranchOrderSum- t2 . BranchOrderSum) < 


rl . RValue 



Fig. 13. Multiple queries. 



and (n, m) correspond to the first and last occurrences of expressions that have 
the prefix path. This changes the query plan to the one in Figure 10. Since there 
is no join between the PathValue and Path tables anymore, the joins in Lines 
9 and 10 now get executed the last. The Pathld and LeafValue predicates 
are evaluated earlier resulting in smaller inputs to the join operations. This 
optimization resulted in an improvement of up to 60% in query execution time 
as shown in Section 6. 

Optimization for Recursive Path Expressions. A lexicographic numbering 
of paths is not sufficient for recursive expressions when the DTD structure is a 
graph. Figure 12 shows an example of such a DTD. It has a graph structure 
due to the recursion on the section element. If only lexicographic Pathld is 
available, expressions such as //title cannot be optimized i.e., converted to a 
range expression instead of a join. We assign another pathld, called CPathld, to 
a Path based on the following rules: (1) Elements in the DTD graph are ordered 
by the number of incoming edges. Lexicographic ordering is followed within this 
ordering. Figure 12 shows the “reordered” graph. The element title is ordered 
first as it has the highest number of incoming edges. 1 . . . n are the CPathld 
values for paths ending in title. (2) Cycles in the DTD graph are handled by 
clustering paths with the same non-recursive element after the end of the cycle. 



504 Sandeep Prakash, Sourav S. Bhowmick, and Sanjay Madria 



Based on this rule, /book/ section/title, /book/section/section/title,. . 
/book/section/, ./section/title would all occur consecutively for the DTD 
in Figure 12. This allows the replacement of paths such //section//title with 
range expressions in the SQL translation. 

The SucxentH — b schema has to be extended to incorporate the CPathJd 
attribute together with the existing PathJd column in Path Value (Figure 5). 
Any recursive path expression can be now be converted to a range query on 
the CPathJd attribute. Consider the following examples: (1) //title is replaced 
by (p.CPathld >= 1 and p.CPathld <= n) as all paths ending in title have 
CPathld values between 1 and n. (2) Consider the path //section/title. To 
begin with, the first and last CPathld values of “/.section/title in the Path table 
are obtained. Say, these are n/ and ni, respectively. Then, the join expression is 
replaced by (p.CPathld >= n/ and p.CPathld <= rq). 

Optimization Using Multiple Queries. After performing the above two 
optimizations the new query plans still had one major limitation. The last two 
join expressions (lines 9 and 10 in Figure 7) were still being evaluated using 
Hash-Joins. The analysis of the two intermediate results used for the evaluation 
of the join expression found that Nested-Loop would be a better option. 

Forcing a Nested-Loop-based query plan is not a good choice as there are 
cases where Hash- Join (or Merge-Sort join) is still a better option. Our conclusion 
was that we should separate the pre-join results, execute a separate join query on 
these temporary results and let the query optimizer decide. We materialized one 
of the results into a separate temporary table and then executed a join on this 
temporary table and Path Value. The query optimizer now generated a better plan 
for all queries. This optimization resulted in an improvement of up to 7 times 
as shown in Section 6. The final set of queries for the given example, in order 
of execution, is shown in Figure 13. SQL 1.1 corresponds to the intermediate 
result. The resulting query plan is shown in Figure 11. 



6 Performance Evaluation 

SucxentH — b was developed using Java JDK1.5 and a commercial RDBMS 1 . 
The experiments were conducted on a P4 1.4GHz machine with 256MB of R AM 
and a 40GB (7200rpm) IDE hard disk. The operating system was Windows 2000 
Professional. We experimented with the data sets shown in Figure 17 and the 
queries shown in Figure 14 which also indicates the sources of these data sets. 
Note that the DTD graph of the ODP dataset contains cycles. Also, in order to 
measure the insertion/extraction times we use a small subset of the DBLP data 
set with documents that vary in size from 11KB to 1MB. 

Storage Size. Figure 17 shows the relative database sizes for the three ap- 
proaches. Note that, as expected, XParent has by far the largest database size 
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Our licensing agreement disallows us from naming the product 
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Ql: FOR $b in document ( "odp.xml") //topic 
WHERE $b/Title = "Photography" 

RETURN $b/Description 
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Q2 : FOR $b in document ( "odp.xml") //topic 
WHERE $b//Title = "Photography" 

RETURN $b/Description 
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Q3 : FOR $b in document ( "odp.xml") //topic 
WHERE month ($b/lastUpdate) >= 10 
RETURN $b/Description 
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Q4 : FOR $b in document ( "odp.xml") //topic 
WHERE month ($b//lastUpdate) >= 10 
RETURN $b/Description 
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Q5 : FOR $b in document ( "auction. xml" ) /site/regions 
RETURN count ($b/ /item) 
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RETURN count ($b//description) +count ($b//annotation) +count ($b// 
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Q7 : FOR $b in document ( "auction. xml" ) /site/regions/af rica/item 
WHERE contains ($b/ / description, "gold" ) 

RETURN $b/name 
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Q8 : FOR $b in document ( "sprot .xml" ) /sptr/entry 
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Q9 : FOR $b in document ( "sprot .xml" ) /sptr/entry 
WHERE $b/ref erence//person [@name=" Hermann R."] 
RETURN $b/reference 
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Q10: FOR $b in document! "sprot .xml" ) /sptr/entry 
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J The Open Directory Project, http://dmoz.org. 
^The Swiss-Prot Database, http://us.expasy.org 



Fig. 14. Queries and their features. 



among the three approaches and SucxentH — b has the smallest. In the Shared- 
Inlining approach indexes are created on all columns to aid in query processing. 
We did notice that the non-indexed database size for the Shared-Inlining ap- 
proach was by far the smallest among the three. This means that indexing all 
columns in the inlining approach is not a good strategy as far as storage is 
concerned and should instead be based on the query workload. 

Decomposition/ Extraction Times. Figures 15 shows the results for docu- 
ment load performance which is dependent on the number of tuples inserted. As 
expected, XParent takes the longest for inserting documents. The performance 
of Sucxent-| — b and the inlining approach is quite comparable. 

Extraction time depends on the time taken to extract the relevant tuples 
and main- memory processing time to reconstruct the document. The results in 
Figure 16 show that Sucxent-| — b performs marginally better than XParent and 
up to 40% better than Shared-Inlining. The inlining approach has to join several 
tables to get all the data needed for document reconstruction. As an indication 
of the data fragmentation consider that 34 tables are created for the Swiss-Prot 
data set. In addition, the main-memory processing time is also higher due to the 
fragmented nature of the retrieved data. 

The extraction performance of Sucxent-I — b is only slightly better than 
XParent. Even though the time taken to extract the relevant tuples (only leaf 
nodes) is smaller than the corresponding operation in XParent (that involves 
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Fig. 15. Load performance. 



Fig. 16. Extraction performance. 



Data Set 


Size 

MB 


Node Count 


Sucxent 


Inlining 


XParent 


ODP 


142 


2,884,074 


174 


178 


402 


XMark 


150 


2,668,227 


202 


209 


477 


Swiss-Prot 


150 


6,508,774 


211 


208 


453 



Fig. 17. Storage size. 



retrieving all the nodes of the document) , we still have to perform substring op- 
erations to determine the nodes in a path in order to create the document tree. 
In Step 7 of Figure 6 the process of obtaining the node array from the path is 
accomplished by the substring operation. This means that though retrieval time 
from the database is better in Sucxent-| — b the time taken for reconstruction is 
more. In fact, for smaller documents, when the number of tuples stored in the 
database is not very significant, Sucxent-| — b performs worse than XParent. 

Query Performance. Our preliminary experiments showed that Shared- 
inlining outperforms Sucxent-| — b for query loads similar to the ones described 
in [6]. However, our experiments also show that recursive queries with certain 
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Fig. 18. Query performance. 
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Fig. 19. Variation with distance. 
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Fig. 20. Variation with D — depth. 



characteristics perform better in SucxentH — b especially with the optimizations 
discussed in Section 5. Query performance is partially influenced by the flexible 
manner in which the XQuery return clause can be specified. In particular, two 
factors effect performance. One is the distance between the where and return 
clause elements, defined as the number of edges with cardinality of 1 or more 
(*) between these elements in the DTD. For example, the distance between price 
and name under item in the DTD in Figure 1 is 0 as their is no edge between 
them with cardinality of 1 or more. Similarly, the distance between europe and 
price is 1. The distance corresponds to the number of joins in the SQL query 
as generated by the Shared-Inlining approach. As another example, consider the 
query in Figure 7 on the document in Figure 2. The distance between the return 
and where elements is greater than 1. The exact distance is not known as the 
schema is recursive and there could be any number of recursive text elements. 
For the shared-inlining approach this distance is the number of tables that need 
to be joined, thus effecting performance. The other factor is the depth of the 
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element specified in the return clause. Shallow elements would require a greater 
number of joins in the shared-inlining approach as the descendants are likely to 
be fragmented across several tables. 

Figure 18 shows the results for query performance. Figure 19 shows the vari- 
ation of query execution time with increasing distance between the where and 
return clauses in the XQuery query. Figure 20 shows the results as the depth of 
the return clause element is reduced (or, as ( D — depth) is increased, where D 
is the maximum depth of the document) . Notice that query performance results 
are shown on the log scale and the return clause results are shown as a ratio 
with respect to the baseline result. We first discuss query performance without 
optimizations. 

Query performance without optimizations. Slrared-Inlining performs bet- 
ter than Sucxent+-|- for queries Q i and Q 3 . This is because the corresponding 
SQL query only involves the topic table and there is no need for any recursive 
SQL query as it is known that all topic elements are in the topic table and 
only their immediate title values need to be queried. Sucxent-I — h, on the 
other hand, has to execute a join query between the Path and PathValue tables. 
In addition Qg involves a typecast to date in Sucxent+-|-’s case as all data is 
stored as strings. Q 2 and Q4 are quite similar to Q 1 and Q 3 except that the 
title element can be a descendant and not just a child. A recursive SQL query is 
generated for the Slrared-Inlining approach using the technique mentioned in [3] . 
In Sucxent-| — |- we only need to ensure that the intersection level is equal to the 
Length of the path minus one as the Description element is a child of the topic 
element in question. XParent also benefits from this approach and therefore per- 
forms better than Slrared-Inlining. For Q 5 , Slrared-Inlining involves a UNION of 
the joins between item and each of asia, namerica, samerica, europe, af rica 
and australia. In Sucxent-| — |- this query merely look for paths with the ex- 
pression /site/regions/*/item. However, the Slrared-Inlining approach will 
perform much better if we use the knowledge that item is not used anywhere 
else in the document. Then, it would reduce to a count of the item table. This 
is highlighted by <56 where the Slrared-Inlining approach performs much better 
than Sucxent-) — |- or XParent (by 35 times). Here, the result is merely a sum of 
the tuples in description and annotation. This can be done because the paths 
are evaluated with respect to the root and it is implied that all description 
and annotation elements will be counted. 

Sucxent-| — |- performs better than Slrared-inlining for Q 7 to Qg. This is be- 
cause the result that needs to be returned is in a different subtree. This leads 
to a greater number of joins in Slrared-inlining whereas, the number of joins 
remains unaltered for Sucxent-| — h The difference is greater for Qg (about 5 
times) than the other queries as it involves recursion, significant distance be- 
tween the return and where clause elements and a shallow return clause in 
the form of the reference element (whose descendants are spread across 4 ta- 
bles). However, Sucxent-| — |- performs worse for Qio- This is because of the 
poor query plan generated by the database and can be resolved by applying 
the optimizations discussed in Section 5. To summarize, Sucxent-I — |- outper- 
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forms Shared-Inlining for 6 out of 10 queries by up to 5 times. Shared-Inlining 
outperforms SucxentH — b for one particular query by 35 times. 

Query performance with optimizations. Notice that there is an improve- 
ment in most queries after the optimizations. Q\ to Q 4 , Q 7 and Q10 show a more 
remarkable difference. For Q 3 and Q 4, this is partially due to the removal of the 
date typecast. When materializing the intermediate result we insert lastUpdate 
leaf nodes as date types. Also, all three optimizations are used for these queries. 
Q 5 and Qq only benefit from the first two optimizations. The intermediate re- 
sult sizes for Qs and Q 9 are not large enough to benefit from the optimizations. 
In fact, Qs is adversely effected due to the overhead of the optimizations and 
performs worse. Q 10, on the other hand, shows a significant performance im- 
provement and outperforms Slrared-inlining approach. This is due to the better 
query plan generated as a result of using all three optimizations. 

Performance variation with distance. For this section we have used queries 
Qs to Q 9 for comparison. The reason being that they represent real-world sce- 
narios for queries with distant elements in the return and where clauses. Notice 
that Sucxent-| — b’s performance is independent of the distance between the 
where and return clause elements. This is expected as the number of join oper- 
ations remains unchanged. A performance change will be seen only if the number 
of elements that need to be returned changes. The performance of the Slrared- 
inlining approach, on the other hand, is effected considerably. In fact, for queries 
where the elements are in the same table, Shared-inclining outperforms Sucx- 
ent-| — b significantly. As the distance increases Slrared-inlining performs worse 
due to the increase in the number of joins. The increase can be as much as 9 
times depending on the sizes of the tables involved in the joins. 

Performance variation with shallowness of return clause. This effects the 
query performance significantly as shown in Figure 20. The results are plotted 
against decreasing depth (or increasing ( D — depth ) where D is the maximum 
depth of the document) of the return clause. Notice that the performance of 
SUCXENT++ is also effected adversely by up to 8 times. This is because a greater 
number of leaf nodes need to be returned. However, there is no increase in the 
number of joins. Therefore, the performance degradation is not as severe as it is 
for Slrared-inlining which can be effected by up to 17 times. 

7 Conclusions 

In this paper, we demonstrate that execution of certain types of recursive queries 
is more efficient using SucxentH — b instead of schema-conscious approaches. 
SucxentH — b performs better for recursive queries that have 1) Recursion in 
the schema, 2) A large distance between the elements of the where and return 
clauses and, 3)Slrallow return clause elements. Recursive schema result in re- 
cursive SQL queries with the number of joins depending on the recursion level. 
For the latter, the number of joins increases with distance and shallowness. To 
summarize, SucxentH — b outperformed Shared-Inlining for 6 of the 10 queries 
we tested without optimizations. Once the optimizations were used it performed 
better for 7 queries. 
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Abstract. Advanced personalized web applications require a carefully dealing 
with their users’ wishes and preferences. Since such preferences do not always 
hold in general, personalized applications also have to consider the user’s cur- 
rent situation. In this paper we present a novel framework for modeling situa- 
tions and situated preferences. Our approach consists of a general meta model 
for situations, which can be applied as foundation for situation models in a wide 
range of applications. Furthermore, an XML-based preference repository for the 
storage and management of situated preferences is developed. Long-term and 
situated preferences can easily be accessed with the preference repository inter- 
face. Particularly, preferences best-matching to a given situation can be queried. 
This approach allows web applications to react flexibly and personalized to the 
changing situations of their users. 



1 Introduction 

The enormous growth of web content and web-based applications leads to an unsatis- 
factory behavior for users: search engines retrieve a huge number of results to the 
user’s keywords and he/she is left on his/her own to do the time-consuming task of 
finding interesting web sites or relevant products. Frequently customers who are will- 
ing to buy something cannot do it since they do not find the right product even if it is 
available. Such an uncooperative query behavior leads not only to frustrated users but 
also to a reduction of turnover in commercial business. In recent years several person- 
alization techniques have been developed to improve web applications to users [24]. 
Advanced personalization requires a carefully dealing with the user’s wishes and 
preferences [14]. Such preferences include explicit user preferences entered through a 
query interface, long-term preferences gained with preference mining [11, 12], or 
preferences given with a user feedback mechanism. All of these preferences should be 
managed intelligently in a preference repository. This preference repository plays a 
major role for the various components interoperating during the personalization proc- 
ess: personalized query composition [18] has to assemble the query using the various 
user preferences, recommendation technologies can help customers in finding inter- 
esting products by offering items similar customers have bought [3, 11], and personal- 
ized data dissemination and result presentation adapts the delivery and presentation of 
the query results to the user’s preferences [15, 16, 27]. Fig. 1 shows the architecture 
for deeply personalized applications using preference technologies. 

The components of this architecture can be applied in various application areas like 
personalized mobile services [27], intelligent e-procurement [15], or individualized 
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Fig. 1 . Preference-centered architecture for personalized applications 



financial services. As a fact of life the user preferences in such applications do not 
always hold in general but may depend on underlying situations. For instance, a cus- 
tomer may have different shopping preferences depending on his location (being at 
work or at home) or on the time of day [15]. In order to integrate such situations into 
preferences a generic approach on situation modeling has to be developed. 

Initially let us consider some related work. So far in conceptual modeling and da- 
tabase technology only little research on situation models has been done. The prag- 
matic approach of computer scientists is usually focused on the needs of the underly- 
ing application. For instance, in [26] information like the user’s position, timestamp, 
or weather is used to define the user’s current situation. In [25] the so-called context- 
aware mobile computing distinguishes situations w.r.t. “where you are”, “who you are 
with”, and “what resources are nearby”. Other context-aware applications consider 
only selected aspects of situations like the timestamp or the user’s location [2, 5]. 

In cognitive science a few work on situation modeling has been done in the past. 
Barwise and Perry developed the idea that a situation is composed of a collection of 
entities, whereby each entity may have a set of properties associated with it [1], The 
entities of a situation can be any meaningful object, such as person, inanimate object, 
or abstract idea. Furthermore, there can be some relations within these entities. These 
relations include, but are not limited to, spatial, temporal, or ownership relations [22]. 

With these ideas as foundation we develop an entity-relationship based meta model 
for situations in this work. Since its introduction to the database community in 1976 
by [4], the entity-relationship modeling technique has established as a major design 
tool for the development of database applications. The main advantages are the se- 
mantically rich nature and the easily interpretable structure of the resulting ER mod- 
els. With our extensions on situated ER models we follow this philosophy by provid- 
ing an intuitive framework for the integration of situations into the ER models of 
personalized applications. 

The rest of the paper is organized as follows: In Section 2 we describe our novel 
meta model for situations, give examples for situation models in personalized applica- 
tions, and introduce the concept of situated preferences. Section 3 describes the pref- 
erence repository as the major component for the storage and management of long- 
term and situated preferences. Situated preferences introduced in Section 2 are not 
restricted to a specific preference model, whereas in Section 3 we use a strict partial 
order approach for modeling preferences. We conclude our paper with a summary and 
outlook in Section 4. 
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2 Situated Preferences 

Preferences and wishes are key components of personalized applications. In real life a 
user’s preferences are typically not changeless but vary due to different situations. For 
instance, a user may have various news preferences depending on the temporal situa- 
tion: on Friday he is interested in the current exchange rates of his stocks and on the 
other days his preferred news categories are politics and science. In this section we 
introduce an appropriate approach on modeling such situations for personalized 
applications. 

2.1 The Meta Model for Situations 

The Oxford Advanced Learner’s Dictionary (http://www.oup.com/elt/oald) defines a 
situation as “all the circumstances and things that are happening at a particular time 
and in a particular place”. This definition as well as Barwise’s and Perry’s work in [1] 
denote location and time as important aspects of situations. Our meta model considers 
spatial-temporal entities as important, too, but also includes further - often applica- 
tion-specific - entities for describing situations. We define our meta model for situa- 
tions with entity-relationship modeling techniques. 

Fig. 2 represents the meta model of situation-oriented entities and relationships. 
Thereby the Situation is the most general entity type of situation models. It can con- 
tain any attributes describing the situational context of people, agents, applications, 
etc. Timestamp denotes the date and time of situations and the entity type Location 
can describe the current position. Attributes for Timestamp can be SQL data types like 
date, time, time zone, etc. It can also be described in more detail using temporal ER 
modeling techniques [9]. Attributes for the Location are, for example, city, zip-code, 
or global positioning system coordinates (GPS). Influences describes other aspects 
affecting a situation. Personal Influences denotes human factors of a situation like 
physical state or current emotion. Surrounding Influences describes outer influences 
like weather condition or other people the current user is together with. Each situation 
can consist of one timestamp and of one location but it can have one or more influ- 
ences (e.g. a personal and a surrounding influence). A timestamp, location, or influ- 
ence can be part of more than one situation. Personal Influences and Surrounding 
Influences are sub-entities of Influences. 

This framework for modeling situations can be integrated into existing ER-models 
that need to be enhanced with situational context. For instance, personalized applica- 
tions typically have an entity like user , which can be connected - with appropriate 
relationships - to entities describing the user’s situation. If required, above meta 
model can be extended with further situated sub-entities. 

2.2 Use Cases for Situation Models 

COSIMA is an online application providing electronic bargaining for computer hard- 
ware products [8]. During the bargaining process it is very important to notice the 
customer’s current situation - like being angry or pleased - and to react appropriately. 

Fig. 3 describes a situation model for the COSIMA application. For each customer 
the personal role is noticed [23], since it is a difference for COSIMA, if she is bar- 
gaining with an ordinary or a chief purchaser. The emotion of the customer is also 
important so that COSIMA can react to the current constitution of her counterpart. 
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Fig. 2. Extensible meta model for situations 




Fig. 3. Situated ER-model for COSIMA 

The OCC emotion model introduced in [20] can be used as underlying domain. The 
product of the current bargaining process is the most important attribute of the sur- 
rounding influences, because it is a great difference whether COSIMA is bargaining 
for a PC mouse or for a workstation. Additionally, the situation includes information 
about the current time and date and the location of the purchaser. A situation can be 
identified with a situation identifier sid. Application-specific attributes for the cus- 
tomer can be modeled with common standards (e.g. xCIL [19]). 

As a second case study we design a situation model for personalized web sites. As- 
sume a web portal provides personalized information like sports news, current stock 
quotations, weather forecast, etc. It would be a great added value for the users to get 
such information not only personalized but also with respect to the current situation. 
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Fig. 4. Situated ER-model for personalized web sites 



Fig. 4 represents a potential situation model for such a situation-based personalized 
web portal. The role is used to decide whether the user is at home or at work [23]. The 
timestamp helps to present the right news categories at the right time. The spatial 
information can be used to deliver regional weather forecast or regional sports results. 
The user’s device and the according screen size allow the adaptation of the web con- 
tents and the layout to the user’s current technical environment. For instance, the 
resolution should be reduced and the contents should be arranged appropriately, if he 
uses a mobile phone. Browsing device and screen size can be specified with the 
CC/PP framework defined by the W3C consortium [17]. An instance of such an ER 
model defines a concrete situation (Fig. 5). 



situation, sid = 212 




timestamp. date = 2003.01 . 1 2 
timestamp.time = 20:56: 1 9 
timestamp.timezone GMT- 1 


location.city = Munich 
location.zip 80469 
personal influences. role - home 


surrounding_influencesxlient_device = pda 


surrounding influences. screen size = 320 x 240 



Fig. 5. Instance of a situation 



Such values for an actual situation can be queried from the system, calculated with 
GPS technologies (global positioning system), or gained with the help of some meta 
information. 

2.3 Modeling Situated Preferences 

Since situations can have great influence on the user’s preferences, we consider situ- 
ated preferences in this section. There are various frameworks for preference models 
(see [6, 14] for a discussion on preference models). Our following approach is kept 
independent from the underlying preference model. 

In the previous situation modeling examples we stated an identifier for situations. 
This sid can be used to refer to situations. For instance, sid = 212 corresponds to the 
situation described in Fig. 5. Preferences are specified with a pid. For example, a 






516 Stefan Holland and Werner KieBling 



preference like “I like notebooks manufactured by HP and Dell” may be identified 
with pid = 29. Before we can model situated preferences we have to consider the 
various possibilities of situated preferences since some user preferences may hold in 
general, whereas other preferences are only valid in one or more situations. We iden- 
tify three types of preferences: 

• Long-term preference: this preference holds generally. 

• Singular preference: this preference holds in exactly one situation. 

• Non-singular preference: this preference holds in more than one situation. 

We model such situated preferences as N:M relationships between situations and 
preferences. A concrete situated preference can be considered as a tuple ( sid , pid) 
expressing that the preference pid holds in the situation sid. By convention a long- 
term preference is identified with sid = 0. 

Example 1. Classification of situated preferences 

We consider the tuples {(0, 12), (0, 13), (127, 22), (128, 22), (212, 29), (212, 30)} of 
situated preferences. In this example we have two long-term preferences, namely the 
preferences 12 and 13. In the two situations with sid =127 and sid =128 the prefer- 
ence 22 holds. The preferences 29 and 30 are valid, if the situation 212 occurs. ♦ 
The preference repository introduced in Section 3 stores and manages such N:M 
relationships between situations and preferences. 

3 The Preference Repository 

The preference repository defines a general storage structure to manage long-term and 
situated preferences for personalized database applications. We use the strict partial 
order approach for modeling preferences introduced in [14], where an intuitive, pow- 
erful, and flexible constructor-based framework for preferences is given. We define 
the preference repository using the extended markup language. An XML-based 
schema extension for user preferences is already specified within the MPEG-7 stan- 
dard [13]. For two reasons this approach is not appropriate for a preference repository. 
Firstly, it is focused on MPEG-7 documents and therefore general preferences of 
other domains cannot be recorded in a natural way. Secondly, MPEG-7 uses scores to 
describe user wishes leading to preferences with very limited semantic expressiveness 
compared to the general strict partial order approach. 

3.1 Preference Constructors and the BMO Query Model 

In this section we revisit those concepts of the preference model of [14] that are rele- 
vant for the scope of this paper. A preference P is defined as a strict partial order 
P = (A, < p ), where A = [Aj, ..., A k ] denotes a set of attributes with corresponding 
domains dom(A ; ). The domain of A is defined as Cartesian product of the dom(Aj), 
< p c dom(A) x dom(A), and x < p y is interpreted as “y is better than x”. 

For an intuitive preference modeling a set of preference constructors is defined. 
These constructors include POS(A, pos-set), and POS/POS(A, posl-set; pos2-set) on 
categorical domains and LOWEST(A) on numerical domains. The pos-set c dom(A) 
of a POS preference defines a set of favorite values that are better than all other val- 
ues of dom(A). A POS/POS preference distinguishes between optimal (posl-set) and 
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alternative values (pos2-set). For a LOWEST preference lower values are better. Pref- 
erences can inductively be combined with complex preference constructors. Relevant 
for this paper is the Pareto preference P = Pj ® P 2 , which treats the underlying prefer- 
ences as equally important. Precise definitions can be found in [14]. 

Such preferences can be evaluated on SQL or XML databases using the query lan- 
guages Preference SQL or Preference XPATH, respectively [16]. These query en- 
gines return those database elements that match best to the strict partial order speci- 
fied in the preference expression. This behavior is called best-matches-only (BMO) 
query model [14]. It accomplishes a suitable match-making between wishes and real- 
ity. For instance, applying the preference P = POS/POS(manufacturer, {HP); [Dell]) 
items produced by HP are delivered, if they exist in the database. Otherwise, if prod- 
ucts manufactured by Dell exist, they are returned. Otherwise, products from different 
vendors are delivered. Several real-life applications use this preference model for 
describing user wishes [8, 15]. 



3.2 Preference Repository Document Type Definition 

Using an XML-based preference repository has several advantages. By defining re- 
cursive elements in the according document type definition combined preferences of 
any complexity can be recorded. A clear specification avoids the storage of prefer- 
ences not conform to the preference algebra defined in [14]. Furthermore, an XML- 
based preference repository can be accessed either via XPATH and Preference 
XPATH [16] or from object-oriented programming languages like Java or C++ using 
the document object model (DOM). Another aspect is the interchangeability of XML 
documents. Preference repositories based on XML can be interchanged between vari- 
ous personalized applications. 

In this section we describe some interesting details of the preference repository. 
The key tags of the document type definition are given in Fig. 6. The element PrefRe- 
pository is the root element of a preference repository XML document. Preferences 
are stored for each user separately. As proposed in [13] a user is denoted with a 
Userid element, where the user’s name must be unique indicated with the XML type 
ID. A Userid can also represent a user group or a whole domain, where the latter can 
be used for storing domain preferences. For details on modeling user groups and 
stereotypes see [23]. For each user several (situated) preferences can be recorded as 
PrefData elements. The Preference elements consist of the various preference con- 



<! ELEMENT Pref Repository (Userid*) > 

<! ELEMENT Userid ( PrefData* )> 

< ! ATTLIST Userid name ID #REQUIRED> 

<! ELEMENT PrefData (Preference*, Source?, Situation*> 
<! ATTLIST PrefData name ID #REQUIRED> 

<! ELEMENT Situation (Condition* ) > 

<! ATTLIST Situation sid ID #REQUIRED> 

<! ELEMENT Condition (EMPTY) > 

<! ATTLIST Condition key CDATA #REQUIRED> 

<! ATTLIST Condition value CDATA #REQUIRED> 



Fig. 6. Main elements of the preference repository 
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structors as sub-elements, whereby for each preference relevant additional informa- 
tion - like the pos-set of a POS preference - has to be recorded. Combined prefer- 
ences of any complexity can be stored with the recursive definition of the correspond- 
ing XML elements. For the detailed description of all preference elements in the 
preference repository see [11]. A PrefData element contains zero, one, or more pref- 
erences and zero, one, or more situations representing the N:M relationship between 
preferences and situations introduced in Section 2.3. Situations are stored as condi- 
tions with key-value pairs representing the allocation of the attributes of the situation 
model with concrete values. The Source denotes the origin of a preference. The origin 
could be, e.g., preference mining [11, 12] or preference query languages like Prefer- 
ence SQL or Preference XPath [16]. 

We consider the COSIMA application for an example of a preference repository 
XML document (Fig. 7). In her role as purchaser Laverne has a LOWEST(price) 
preference if she is bargaining for a notebook and is feeling satisfied. If she has the 
same emotion and the same role but is bargaining for a desktop pc, she has a 
POS(installed_software, [Windows XP, Acrobat Reader}) preference. 

3.3 The Preference Repository Query Interface 

The methods for querying the preference repository are twofold. One the one hand we 
define a set of useful operations like “get all preferences of a specific customer” and 
on the other hand we provide a Preference XPATH interface [11]. With it the applica- 
tion developer can query the preference repository using the full functionality of 
XPATH and Preference XPATH. 

In personalized applications it is important to know, which user preferences are 
valid w.r.t. a specific situation, so that the application can react appropriately. For the 



<Pref Repos itory> <UserId name=" Laverne "> 

<PrefData name="Lavernel"> 

<Preference pid=31> <LOWEST att="price"/ > </Pref erence> 
<Situation sid=127> 

<Condition key=" product" value="notebook" /> 

<Condition key="role" value="purchaser"/> 

<Condition key=" emotion" value="satisf ied"/> 
</Situation> </PrefData> 

<PrefData name="Laverne2 "> 

<Preference pid=35> <POS att="installed_sof tware"> 
<POSSet> <Value val= "Windows XP" /> 

<Value val="Acrobat Reader" /> 

</POSSet> </POS> </Pref erence> 

<Situation sid=128> 

<Condition key=" product" value="desktop"/> 

<Condition key="role" value="purchaser" /> 

<Condition key=" emotion" value="satisf ied"/> 
</Situation> </PrefData> 

</UserId> </Pref Repository> 



Fig. 7. Preference repository example 
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finding of singular and non-singular situated preferences we postulate the following 
requirements: 

1 . It should be possible to get preferences that belong to a given situation. 

2. It should be possible to get preferences, whose situations match best to a given 
situation. 

The former requirement can be fulfilled with appropriate XPATH requests on the 
preference repository. These requests are exact-match-queries with the structure 

Situation/ Condition [@key=attribute_name and 
@value=attribute_value] 

where the situation - described as condition - can be specified by using the full 
XPATH functionality. In the latter case we use Preference XPATH, since this query 
language implements the BMO query model (see Section 3.1). The given situation is 
considered as preference expression and Preference XPATH computes those Pref- 
Data elements with best matching situations. Our access operations include already a 
Preference XPATH interface, so such queries can be executed without any additional 
effort. 



3.4 Query Examples 

Below we give some examples for XPATH and Preference XPATH queries. They are 
based on the preference repository in Fig. 7. 

Example 2. Querying long-term preferences 

At first, we want to get Lav erne’s long-term preferences stored in the preference re- 
pository. 

Userid [@name= 1 Laverne ' ] /PrefData [Situation [@sid=0] ] 

Long-term preferences are identified with sid = 0 (see Section 2.3). The corre- 
sponding preferences can be queried with XPATH using a hard condition ( [@sid = 
0 ] ). ♦ 

Example 3. Querying situated preferences using hard conditions 
Assume Laverne is bargaining for a notebook. We know with mimic recognition [21] 
that she is feeling satisfied. Furthermore, her current role is chief purchaser. We are 
interested in those preferences that are relevant in this situation. 

Userid [@name= 1 Laverne 1 ] /PrefData 
[Situation/Condition [@key= 'product ' and 

@value= ' notebook ' ] ] 

[Situation/Condition [@key= 1 emotion 1 and 

@value= ' satisfied' ] ] 

[Situation/Condition [@key= ' role ' and 

@value= ' chief_purchaser ' ] ] 

In this XPATH query the situation is specified as a hard condition. Applying this 
query on our repository example (Fig. 7) an empty result is delivered since there is no 
exact match. ♦ 

Example 4. Querying situated preferences using soft conditions 

The embarrassing empty result effect occurring in the previous example is caused by 

the exact-match query model of XPATH. Using Preference XPATH with BMO query 
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semantics we can avoid this effect. If perfect matches do not exist, preferences with 
best-matching situations are returned. 

For each attribute of the situation the attribute name is specified with a hard condi- 
tion that must be fulfilled (e.g. @key= 1 product 1 ). The value is considered as soft 
condition that should be fulfilled (e.g. POS (value, { 'notebook' }). The com- 
ponents of the query are assembled with Pareto accumulation (‘®’ -operator). Soft 
conditions are denoted with ‘# [’ and ‘] #’ in Preference XPATH syntax. 

Userid [@name= ' Laverne ' ] /PrefData 

# [Situation/ Condition [@key= ' product ' ] # [POS (value, 

{ 'notebook' } ) ] # 

® Situation/Condition [@key= ' emotion ' ] # [POS (value, 

{ ' satisfied' } ) ] # 

® Situation/Condition [@key= ' role ' ] # [POS (value, 

{ ' chief_purchaser '})]#]# 

If available, Preference XPATH delivers perfect matches. Otherwise, best alterna- 
tives are returned. Using our repository example (Fig. 7) the preference with p/d =31 
is returned since the situation sid = 127 is best-matching to the given situation. ♦ 

Such queries for situated preferences can additionally be improved by using se- 
mantic knowledge of the underlying domain stored in an appropriate ontology [7]. We 
discuss this in the following example. 

Example 5. Querying situated preferences using ontologies 

Assume a simple ontology for computer hardware products is given (Fig. 8). We 
know that Laverne is currently bargaining for a pda, and therefore we are interested in 
relevant situated preferences. Our repository example in Fig. 7 contains only situated 
preferences for notebooks and desktops. Using an ontology-based “broaden” mecha- 
nism within the Preference XPATH query related situated preferences can be deliv- 
ered. 

Userid [@name= ' Laverne ' ] /PrefData [Situation/Condition 

[@key= 'product ' ] # [POS/POS (value, {pda} ; broaden (pda) )] # 



<Subject name=" computing systems"> 

<Topic name="pda"> </Topic> 

<Topic name="workstation"> </Topic> 

<Topic name=" not ebook "> </Topic> 

<Topic name=" tablet pc"/ > </Topic> 

<Topic name=" desktop "> </Topic> 

</Subj ect> | 

Fig. 1. Ontology example 

The function broaden (term) computes the set of ontology elements that have 
the same parent node as term. Above Preference XPATH query delivers now the 
situated preferences that hold if Laverne is bargaining for a pda. If none exists, situ- 
ated preferences for the products workstation, notebook, tablet pc, and desktop are 
delivered. ♦ 
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3.5 Updating the Preference Repository 

User preferences and situations can change over time. Assume Laverne states that 
there is no situation, in which she has a preference for installed software. Further- 
more, she informs us that her LOWEST(price) preference holds generally. Finally, the 
Preference Miner detects a situated POS (manufacturer, {Dell}) preference that holds, 
if her role is chief purchaser. 

Such changes of the user’s preferences can be gained from preference query lan- 
guages [16], user feedback or preference mining [11, 12]. Details about the varying 
situations can be queried from the system (timestamp), calculated with GPS technolo- 
gies (location), or transmitted with an appropriate data transfer protocol (client de- 
vice). The emotion can be detected with the application of mimic recognition [21]. 

Updates of situations and preferences can easily be managed by using the prefer- 
ence repository interface. The predefined methods for inserts, updates, and deletions 
have several benefits for the application developer. If he uses these access operations, 
he does not have to care about the detailed structure of the preference repository. 
Preferences of any complexity can be stored in the preference repository and it is 
guaranteed that they are inserted and updated correctly. Query interfaces typically 
produce a large amount of preferences. Using the predefined access operations they 
can be inserted and managed in the preference repository in a comfortable way. 



4 Summary and Outlook 

In this paper we presented a novel framework for modeling situated preferences and 
preference repositories. We defined an extensible meta model for situations so that 
well-designed situation models can be created and integrated into existing ER models. 
Modifications of situated ER-models are uncomplicated and therefore changes in the 
situation model can also be handled. With it we support a straightforward software 
development process for situated and personalized database applications. 

The preference repository holds an eminent place for personalized applications. It 
allows the storage and management of the users’ long-term and situated preferences. 
The stored preferences can be applied for personalized query composition, user- 
centric product recommendations, or personalized result presentation. With the Pref- 
erence XPATH interface preferences best-matching to a given situation can be que- 
ried. This approach allows e-applications to react flexibly and personalized to the 
changing situations of their customers. The preference repository is already in practi- 
cal use in the intelligent e-procurement prototype COSIMA B2B [15], which is part of 
the interdisciplinary Bavarian research cooperation FORSIP on “Situated, Individual- 
ized, and Personalized Fluman-Computer Interaction’’ (http://www.forsip.de). 

This paper has introduced innovations on situated preferences and preference re- 
positories that suggest several promising directions for future research. One direction 
is the dealing with dynamic situations. In personalized applications a user’s situation 
may change during a session. For example, his role or his emotional state may change 
during a shopping tour. The integration of such dynamic situations into our situated 
meta model forms an interesting research task. The consideration of a user’s interac- 
tion history [10] may also enrich our situation models noticeably. Another direction 
deals with the detection of situated preferences. A promising approach is the adapta- 
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tion of preference mining algorithms. Preference mining works on the user’s log data 
containing information about requested articles or bought products. In order to detect 
situated preferences the log has to be extended with situation-specific data, so that for 
each transaction details about the current situation (timestamp, location, emotion, 
client device, etc.) is recorded. This allows the invocation of the preference mining 
algorithms for each situation separately and with it situated preferences can be de- 
tected. Another interesting task is the application of our techniques on situation mod- 
eling and preference repositories in new domains. For instance, the fast-growing area 
of personalized mobile services (e.g. online banking, route planning, or electronic 
shopping) requires a comprehensive knowledge about the situated preferences of the 
mobile users [27], For example, a user may have different preferences on text length 
or image resolution whether he is using his notebook or his pda. Therefore our tech- 
niques will find a broad application area in personalized mobile environments. 
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Abstract. In the area of Web services and service-oriented architec- 
tures, business protocols are rapidly gaining importance and mindshare 
as a necessary part of Web service descriptions. Their immediate benefit 
is that they provide developers with information on how to write clients 
that can correctly interact with a given service or with a set of services. In 
addition, once protocols become an accepted practice and service descrip- 
tions become endowed with protocol information, the middleware can be 
significantly extended to better support service development, binding, 
and execution in a number of ways, considerably simplifying the whole 
service life-cycle. This paper discusses the different ways in which the 
middleware can leverage protocol descriptions, and focuses in particular 
on the notions of protocol compatibility, equivalence, and replace-ability. 
They characterise whether two services can interact based on their pro- 
tocol definition, whether a service can replace another in general or when 
interacting with specific clients, and which are the set of possible inter- 
actions among two services. 



1 Introduction 

Web services, and more in general service-oriented architectures (SOAs), are 
emerging as the technologies and architectures of choice for implementing dis- 
tributed systems and performing application integration within and across com- 
panies’ boundaries. The basic principles of SOAs consist in modularizing func- 
tions and exposing them as services, that are typically specified using (de jure 
or de facto) standard languages and interoperate through standard protocols. 
Web service technology is characterized by two trends that were not part of con- 
ventional (e.g., CORBA-like) middleware services and that are relevant to the 
topics discussed in this paper. The first is that, from a technology perspective, 
all interacting entities are considered to be (Web) services, even when they are 
in fact requesting and not providing services. This allows uniformity in the spec- 
ification language (for example, the interface of both requestor and providers 
will be described using the Web Services Description Language - WSDL) and 
uniformity in the development and runtime support tools. 
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The second trend, that is gathering momentum and mindshare, is that of 
including, as part of the service description, not only the service interface but 
also the business protocol supported by the service, i.e., the specification of which 
message exchange sequences (called conversations in the following) are supported 
by the service [3]. This is important, as it rarely happens that service operations 
can be invoked at will independently from one another. The interactions between 
clients and services are always structured in terms of a set of operation invo- 
cations, whose order typically has to obey certain constraints for clients to be 
able to obtain the service they need. In the following, we use the term external 
specification to refer to the combination of the interface and business protocol 
specifications, that define the externally visible behavior of a service [1,12]. In 
addition to the business protocol 1 , a service may be characterized by other pro- 
tocols, such as security (e.g., trust negotiation) or transaction protocols that also 
need to be exposed as part of the service description so that clients know how 
to interact with a service [3, 10]. 

If two or more services need to interoperate, their protocols must be com- 
patible. For example, a bookseller’s business protocol may require customer’s 
Web services to first invoke the orderBook operation and then the makePayment 
operation. If a requestor wishes to interact with this service, then its business 
protocol will need to include the invocation of the orderBook operation followed 
at some point by the invocation of the makePayment operation. If this is not 
the case, then the interaction between the two entities will result in an error. 
Hence, it is essential that requestors are only bound, statically or dynamically, 
to providers that have compatible protocols. 

This paper analyzes protocol compatibility and similarity in Web services. In 
particular, we define and characterize different types of protocol compatibility, 
corresponding to different capabilities of services to interoperate, and we show 
how, given two services and their external specifications, it is possible to for- 
mally identify their compatibility level. In addition, we discuss similarities and 
differences between protocols, to understand if two services exhibit the same 
behavior or if one can be used instead of another when serving a certain client. 
In doing the analysis, our motivation and goal is to devise protocol management 
primitives that support and simplify service development. This complements our 
earlier efforts aiming at designing and developing a complete CASE tool sup- 
porting the Web service lifecycle [3, 2, 10]. Indeed, and as discussed in this paper, 
the primitives presented here can be used by service development and runtime 
environment to: i) assist developers in creating and evolving Web services that 
are compatible with other services of interest or with standard protocol spec- 
ifications; ii) identify (statically or dynamically) services that can interoperate 
with a given service; iii) manage non-compatibility situations. 

This paper does not discuss other aspects that are in general relevant to 
identifying whether two services can interact to achieve the desired goals. For 
example, we do not deal with quality of service issues, or with structural and 
semantic interoperability of messages [6] . While we believe that these issues are 
also important, the (syntactic) protocols compatibility and similarity analysis 

1 In this paper we will use “business protocol” and “protocol” interchangeably. 
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discussed here is complex enough in itself to deserve the whole paper (and indeed 
many aspects still remain to be addressed). Finally, we observe that, although 
to make the presentation more concrete we will introduce the concepts based on 
a specific protocol language, the results presented here can be applied to any 
protocol language, such as WSCI or BPEL. 

The outline of the paper is as follows: Section 2 introduces a protocol model 
and some notations and concepts used throughout the paper. Section 3 defines 
a collection of protocol management operators that allow understanding com- 
monalities and differences between protocols, as well as whether two protocols 
can interact with each other. Section 4 introduces compatibility and similar- 
ity classes, and shows how the model and operators developed in the previous 
sections can be used to analyze and understand the kind of compatibility or sim- 
ilarity that two protocols exhibit. Finally, Section 5 concludes the paper with a 
discussion of possible applications of the proposed protocol analysis. 

2 Preliminaries 

2.1 Business Protocols Modeling 

Following our previous work [3], we choose to model a service business proto- 
col (protocol for short) as a non-deterministic finite state machine, where the 
states represent the different phases that a service may go through during its 
interaction with a requestor. Transitions are triggered by messages sent by the 
requestor to the provider or vice versa (hence, transitions are labeled with either 
input or output messages) . A message corresponds to the invocation of a service 
operation or to its reply. Note that each service may be simultaneously involved 
in several message exchanges (conversations) with different clients, and there- 
fore can be characterized by multiple concurrent instantiations of the protocol 
state machine. The purpose of the protocol is essentially to specify the set of 
conversations that are supported by the service. The reason for using a state 
machine-based model is because it a formalism that is fairly easy to understand 
for users, it is suitable to describe reactive behaviors, and it has the notion of 
state which is useful for monitoring service executions. Furthermore, there are 
a number of models and tools (some developed by the authors [2]), that enable 
protocol modelling by means of state machines. The need for non-determinism 
comes from the observation that a service may respond in different ways to a 
certain message, based on internal business logic that is not exposed as part 
of the protocol. For example, in response to an “approval request” message, a 
service may move to different states based on whether the request is approved or 
rejected. However, the criteria by which the service moves to this or that state 
is hidden from the user as it is internal business logic that the provider does not 
want to expose as part of the protocol definition. 

As an example, Figure 1(a) shows a graphical representation of a protocol, 
called Vi, that describes the external behavior of a store service. Each tran- 
sition is labeled with a message name followed by the message polarity 2 , that 

The notion of message polarity is borrowed from [13]. 
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is, whether the message is incoming (plus sign) or outgoing (minus sign). For 
instance, it specifies that the store service is initially in the Start state, and that 
clients begin using the service by sending a login message, upon which the ser- 
vice moves to the Logged state (transition (login(+)). We next provide a formal 
definition of a protocol. 

Definition 1. (Business protocol) 

A business protocol is a tuple V = (<S, so, T, M, 1Z) which consists of the following 
elements: 

— S is a finite set of states. 

— so G S is the initial state. 

— T C S is a set of final states. If T = 0, then V is said to be an empty 
protocol. 

— M is a finite set of messages. For each message m £ M, we define a function 
PolarityfP , m) which will be positive (+) if m is an input message in V 
and negative (— ) if m is an output message in V. In the sequel, we use the 
notation m(+) (respectively, m(—)) to denote the polarity of a message m. 

— a finite set 1Z C S 2 x M of transitions. Each transition (s, s', m) identifies a 
source state s, a target state s 1 and either an input or an output message m 
that is either consumed or produced during this transition. In the sequel, we 
note lZ(s, s ' , m) instead of (s, s', m ) £ IZ. 




(a) The conversation protocol I*, of a store service (b) The conversation protocol P 2 of a client 



Fig. 1. Business protocols. 



2.2 Execution Paths and Execution Trees 

In this subsection, we introduce some important concepts and definitions that 
are used to define the semantics of the protocol model defined above 3 . A protocol 
defines all the possible conversations that a service supports in terms of alternat- 
ing sequences of states and messages. We call these sequences executions paths. 
For example, the sequence Start. login(+). Logged. selectGoods(+). Selecting is an 
execution path of protocol V \ . We are particularly interested in the complete ex- 
ecution paths (i.e. , paths that start from an initial state and ends at a final state) 

3 See [8] for details on the various process model semantics. 
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Fig. 2. Comparing protocols with respect to their branching structures. 



as they denote the set of correct conversations supported by a service. For exam- 
ple, the execution path Start. login(+). Logged. selectGoods(+). Selecting. cancel(+). 
Cancelled corresponds to a complete execution path of protocol V\. The se- 
quence of message exchanges login(+).selectGoods(-|-).selectGoods(-i-).cancel(-|-), 
extracted from the complete execution path depicted at Figure 3(d), represents 
a conversation which is compliant with (i.e., is allowed by) protocol V\ of Fig- 
ure 1(a). 

Since protocols are represented using non-deterministic state machines, ex- 
ecution paths are not enough to capture the branching structures of protocols. 
As an example, Figures 2(a) and (b) show two protocols V and V' that specify 
exactly the same set of compliant conversations (the conversations ml(+).m2(-|-) 
and ml(+ ).m3(+)). However, we can observe that after sending a message ml 
in protocol V , a client interacting with V will have a choice to either send the 
message m2 or m3, while a client interacting with protocol V' will not have such 
a choice. For example, the client protocol Vc depicted in Figure 2(c) can interact 
correctly with the protocol V . However, the interaction of Vc with protocol V' 
may result in an error (e.g., if Vc sends the messages ml and then m2, while 
protocol V decides to move to the state S3 after receiving the message ml). 

To compare protocols with respect to their branching structures, we adopt 
the well known branching-time approach [8] to describe business protocol se- 
mantics. In this approach, the possible conversations allowed by a protocol are 
characterized in terms of trees, called execution trees , instead of paths. The ex- 
ecution trees of a protocol are used to derive what we call conversation trees. In 
a nutshell, conversation trees of a protocol V capture all the conversations that 
are compliant with V (i.e, message exchanges that occur in accordance with the 
constraints imposed by V) as well as the branching structures of V (i.e., which 
messages are allowed at each stage of a conversation) . 

To formally define the notions of execution and conversation trees, we use 
the following definition of a tree as in [9]: A tree is a set r C N* such that if 
xn £ r, for x £ N* and n £ N, then x £ r and xm £ r for all 0 < m < n. The 
elements of r represent nodes: the empty word e is the root of r, and for each 
node x , the nodes of the form xn, for n £ N, are children of x. Given a pair of set 
S and M , an ( S , M)-labeled tree is a triple (r, A, S), where r is a tree, A : r — > S 
is a node labeling function that maps each node of r to an element in S, and 
<5 : t x t — > M is an edge labeling function that maps each edge (x, xn) of r to 
an element in M. Then, every path p = e,no, n^ni , ... of r generates a sequence 
r(p) = A(e).6(e, no).A(no).S(no, noni).A(iioni). . . . of alternating labels from S 
and M. 
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Informally, if S and M correspond to the sets of states and messages, we 
can use an (S, M)-labelecl tree to characterize protocol semantics. In particular, 
the branches of the tree (once mapped with the labeling functions) represent 
execution paths, and the tree hierarchy reflects the branching structures of the 
protocol. 

Definition 2. (Execution trees and conversation trees) 

Let V = (S, so, T , M, 1Z) be a business protocol. 

(a) An execution tree of V is a (S , M) -labeled tree T = (r, A, 6) such that: 

— A(e) = So, and 

— for each edge (x,xn) of t, we have TZ(X(x), X(xn),6(x,xn)) 

An execution tree T = (r, A, 5) is a complete execution tree of the protocol V 
if for every leave x € r we have X(x) £ T . 

(b) If T = (t,X,S) is a complete execution tree of a protocol V , then T c = 
( t,X c ,S ) where A c (:r) = 0,Vx £ r, is a conversation tree which is compliant 
with protocol V. 

For example, Figures 3(a) and (c) show complete execution trees of the pro- 
tocols V and V' of Figure 2. Figure 3(d) shows two complete execution trees 
which are compliant with the protocol V\ of Figure 1. Figure 3(b) shows a con- 
versation tree which is compliant with the protocol V (shown at Figure 2(a)). 
This conversation tree describes the message exchanges that are accepted by V 
(i.e., ml(+).m2(-t-) and ml(+).m3(-|-)) as well as the branching choice allowed 
by V after receiving the message ml. Conversation trees of a protocol are derived 
from complete execution trees by removing labels corresponding to the states. 
For instance, the conversation tree of Figure 3(b) is derived from the complete 
execution tree of Figure 3(a) by removing the labels of the states si, s2, and s3. 
In this paper we use complete execution trees to represent conversations that 
are compliant with a protocol. 

2.3 Protocol Simulation 

The notion of simulation is used in the literature as a relation to compare labeled 
transition systems with respect to their branching structures [8, 9]. Simulation is 
a preorder relation on labeled transition systems that identifies whether a given 
system has the same branching structures as another one. Here, we introduce 
a slightly adapted notion of simulation between protocols that will be used to 
compare protocols with respect to their complete execution trees. 

Definition 3. (Protocol Simulation) 

Let V = (5, so, T , M, IZ) and V' = (S' , s' 0 , T ’ , M', IZ') be two protocols. 

— A relation r C S x S' is a protocol simulation between protocols V and V' 
if whenever (sps^) £ r then the following holds: 

• \nZ(si,s 2 ,m) there is an s 2 such that TZ'(s\, s 2 , m), Polarity(V ,m) = 
PolaritylfP' ,m) and (s 2 ,S 2 ) € r. 

• V(s, s') £ r, if s £ T then s’ £ T' 
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Fig. 3. Example of execution trees of the protocol V\. 



— We use notation Si < to say that there is a protocol simulation r such 
that (si, £ r. 

— We say that protocol V is simulated by V (noted V < V ) iff sg < Sq. 

— We say that two protocols V and V are similar (noted V = V ) iff V < V 
and V < V. 

The following lemma 4 states that the simulation relation allows to compare 
protocols with respect to their complete execution trees. 

Lemma 1. Let V\ = (5 1 , Sq, T 1 , M 1 , 1Z 1 ) and V 2 = {S 2 , Sq, T 2 1 M 2 , 1Z 2 ) be two 
protocols. 

(i) V\ < V 2 iff there exists a node labeling function A 2 : r — 4 5 2 such that for 
each complete execution tree T = (r, A, i5) ofV\, T' = (r, A 2 , i5) is a complete 
execution tree ofV 2 and Polarity(Vi,5(x,x n )) = Polarity(V 2 , 5(x, x n )) for 
each edge ( x,x n ) o/r. 

(ii) V\ = V 2 iffVi and V 2 have exactly the same set of complete execution trees, 
modulo the name of the states. 

2.4 Protocol Interactions 

In the previous subsections we focused on representing a protocol supported by a 
given service and comparing two service protocols using the simulation relation. 
We now address the joint analysis of two protocols, that of a requestor and that 

It should be noted that lemma proofs are not presented due to space reasons. 



4 
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of a provider, to see if interactions between them are compatible. By defining the 
constraints on the ordering of the messages that a web service accepts, a business 
protocol makes explicit to clients how they can correctly interact with the service 
(i.e. , without generating errors due to incorrect sequencing of messages) 5 [13,3]. 
For example, a service that supports the reversed protocol V\ obtained from V\ 
of Figure 1(a) by reversing the direction of the messages (i.e., input messages 
becomes outputs and vice versa) can interact correctly with the store service. 
Interactions between two given protocols can also be characterized in terms of 
execution paths and trees. 

As an example, consider again the protocol V\ depicted in Figure 1(a) and 
its reversed protocol Vi- As the two protocols have exactly the same states, if 
s is a state in the protocol Vi, we use s to denote the corresponding state in 
the protocol V\. The path (Start, Start). login. (Logged, Logged) corresponds 
to a possible interaction between protocols V\ and V\. This path indicates that, 
at the beginning, the two protocols V\ and V\ are respectively at the states 
Start and Start. Then, protocol V\ sends message login and goes to state 
Logged while protocol V\ receives message login and goes to state Logged. 
The path (Start, Start). login. (Logged, Logged) is called an interaction path 
of protocols V\ and V%, Each state in this interaction path consists of a state 
of V\ together with a state of P\. The transition login indicates that an input 
login message of one of the protocols coincides with an output login message 
of the other protocol. Consequently, the polarity of the messages that appear in 
an interaction path is not defined. 

Correct interactions between two protocols are captured by using the notion 
of complete interaction trees, i.e., interaction trees in which both protocols start 
at an initial state and end at a final state. For example, the complete interaction 
tree of Figure 4(a) describes a possible correct interaction between the protocols 
Vi of Figure 1(a) and its reversed protocol V\. The notion of interaction tree is 
formally defined below. 

Definition 4. (interaction tree) An interaction tree between two protocols 
V x = (^Ss^JF^m 1 ,^ 1 ) andV 2 = (S 2 ,s 2 0 ,P 2 ,M 2 ,n 2 ) is a (( 5 1 x S 2 ^ 1 CM 2 ))- 
labeled tree I = (r, A, <5) such that: 

- A(e) = (smsg), and 

— For x € r, A(x) = (s^,,s 2 ) such that s 5 € S 1 and s 2 £ S 2 . Then, for each 
edge (x,xn) of t, we have: P}{s\, s\. n , 8{x, xn)) and lZ 2 (s 2 ,s 2 n ,8(x,xn)), 
and Polarity(Vi,8[x,xn )) yf Polarity(V 2 , 8(x, xn)) 

An interaction tree I = (r, A, 8) is a complete interaction tree of the protocols 
Vi and V 2 if for every leave x £ r we have \{x) £ tF l x T 2 . 

In the sequel, an interaction between two protocols is characterized by the 
set of the complete interaction trees of these protocols. 

It should be noted that the notions of simulation and interactions defined 
above focus on comparing protocols based on their structure and their messages, 

5 Recall that structural and semantics interoperability [6] are outside the scope of this 
paper. 
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(a) An interaction tree I of the protocols P x and P t (b) An interaction tree between Pj and P 2 



Fig. 4. Interaction trees. 



regardless of how states are named. Specifically, when in the formal definition we 
place conditions on the two protocols having the same message (with the same or 
opposite polarity), we mean that they have to refer to the same WSDL message, 
as defined by its fully qualified name. Naming of the states is instead irrelevant, 
as it has no effect on identifying the conversations allowed by a protocol. 

3 Protocol Management Operators 

To assess commonalities and differences between protocols, as well as whether 
two protocols can interact with each other, we define a set of generic operators 
to manipulate business protocols, namely: compatible composition of protocols, 
intersection of protocols, difference between protocols, and projection of protocol 
on a given role. The proposed operators take protocols as their operands and 
return a protocol as their result. Although the proposed operators are generic 
in the sense that they can be useful in several tasks related to management 
and analysis of business protocols, we will show in the next section how these 
operators can be used for analysing protocols compatibility and replaceability. 
Effecient algorithms that implement the proposed operators as well as correct- 
ness proofs are given in [4] . 

3.1 Compatible Composition 

The operator compatible composition allows to characterize possible interactions 
between two protocols, that of a requestor and that of a provider (i.e. , the result- 
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Fig. 5. Examples of protocol management operators. 



ing protocol describes all the interaction trees between the considered protocols, 
and therefore characterizes the possible conversations that can take place be- 
tween the requestor and the provider). This operator, denoted as || c , takes as 
input two business protocols and returns a protocol, called a compatible compo- 
sition protocol , that describes the set of complete interactions trees between the 
input protocols. Informally, the initial state of the resulting protocol is obtained 
by combining the initial states of the input protocols, final states are obtained 
by combining the final states of the input protocols, while intermediate states 
are constructed by combining the intermediate states of the input protocols. The 
resulting protocol is constructed by considering messages of the two input pro- 
tocols which have same names but opposite polarities, and that allow execution 
paths to flow from the start state to end states of the new protocol. All the 
states that are not reachable from the initial state of the resulting protocol as 
well as the states that cannot lead to a final state are removed from the result- 
ing protocol. If the result of a compatible composition of two protocols is empty, 
this means that no conversation is possible between two services that support 
these protocols. Otherwise, the result is the identification of possible interactions 
between these protocols. 

As an example, Figure 5(a) shows protocol Vi\\ c V2 that describes all the 
possible complete interaction trees between protocols V\ of Figure 1(a) and V2 
of Figure 1(b). 

Definition 5 . ( Compatible composition) 

Let V\ = (5 1 , sj, J- 1 , M 1 , TZ 1 ) and V2 = (<? 2 , Sq, iF 2 , M 2 , Tt 2 ) be two protocols. 

The compatible composition V = V\\\ C V2 is a protocol ( S , sq,^ 7 , M, 72.) where: 
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— 5 C 5 1 x S 2 is a finite set of states, 

— so = (sojSq) is the initial state, 

— T C F 1 x T 2 is a set of final states, 

— M C M 1 fl M 2 is a set of messages. Note that, the polarity function is not 
defined for the messages in an a compatible composition protocol. 

— ^((s 1 , s 2 ), (q 1 , q 2 ), m) iff TZ 1 (s 1 , q 1 , m) and lZ 2 (s 2 , q 2 , m) and Polarity (Vi, 
m) Polarity(V2,m). 

— V(s 1 ,s 2 ) € S 1 x S 2 , the state (s,s 2 ) £ S iff (s l ,s 2 ) belongs to a complete 
execution path of V (i.e., a path that goes from the initial state (sq,Sq) to a 
final state (s),s 2 ) £ T). 



3.2 Intersection 

The intersection operator allows the computation of the largest common part 
between two protocols. The intersection operator, denoted as || z , takes as input 
two business protocols and returns a protocol that describes the set of complete 
execution trees that are common between the two input protocols. The result- 
ing protocol is called an intersection protocol. This operator combines the two 
input protocols as follows: states of the resulting protocols are constructed using 
the same procedure as in the compatible composition operator. However, unlike 
compatible composition, the intersection protocol is constructed by considering 
messages of the input protocols which have same names and polarities. 

Definition 6. (Intersection) 

Let Pi = (S 1 , Sq, F 1 , M 1 , TZ 1 ) and V2 = {S 2 , Sq,F 2 ,M 2 ,IZ 2 ) be two protocols. 
The intersection V = Vi\\ I V2 is a protocol (S,So,tF,N,IZ) where: 

- SCS 1 xS 2 , 

- so = (sj,s§), 

- T C T 1 x T 2 , 

- M C M 1 nM 2 . 

- lZ((s,q), (s' ,q'),m) iff 1 Z 1 (s , s' , m) and 7 Z 2 (q,q' ,m) and Polarity (Ti,m) = 
Polarity (V2, m). 

- V(s 1 ,s 2 ) £ S 1 x S 2 , the state (s^s 2 ) € S iff (s^s 2 ) belongs to a complete 
execution path ofV (i.e., a path that goes from the initial state (sJ,Sq) to a 
final state (s),s 2 ) £ IF). 

Note that the intersection protocol preserves the polarity of the messages 
(i.e., Vm £ n, Polarity(Vi\\ I V 2 ,m) = Polarity (Vi,m) = Polarity(V2,m)). 



3.3 Difference 

While the intersection identifies common aspects between two protocols, the dif- 
ference operator , denoted as || D , emphasizes their differences. This operator takes 
as input two protocols and V2 , and returns a protocol called difference proto- 
col, whose purpose is to describe the set of all complete execution trees of Pi that 
are not common with P-2- As shown below, we compute the difference as a proto- 
col where states are combination of states of P\ and P2, as opposed to deriving 
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the subset of V\ that is not part of V 2 ■ This will allow us to reuse procedures 
similar to those developed for computing results of previous operators. 

Definition 7. (Difference) 

Let V\ = (S 1 , sj, V 1 , M 1 , 1Z 1 ) and V 2 = (S 2 , Sq, T 2 , M 2 , Tf 2 ) be two protocols 
and let y (jL 5 1 U5 2 be a new state name. The difference V = V\\( 1 V 2 is a protocol 
(S,So,iF,K,TZ) where: 

- S C S 1 x {S 2 U {y}), 

- s 0 = (sJ,So)> 

- VC V 1 x ((<S 2 U {y}) \ V 2 ), 

- MCH 1 . 

- 7 Z((s,q),(s' ,q'),m), with q,<f £ S 2 , iff 1Z 1 (s , s' , m) , lZ 2 (q,q',m) and 
Polarity(Vi,rn) = Polarity(p 2 ,m) , 

- lZ((s,q), (. s',y),m ), with q £ S 2 , iff 1Z 1 (s , s[ m) and not exists q' £ S 2 such 
that lZ 2 (q,q',m) and Polarity (Vi, m) = Polarity (V 2 , m) , 

- iffK 1 (s,s , ,m). 

- V(s 1 ,s 2 ) £ 5 1 x S 2 , the state (s,s 2 ) £ S iff (s 1 ,^ 2 ) belongs to a complete 
execution path of V (i.e., a path that goes from the initial state (sq,Sq) to a 
final state (s),s 2 ) £ T). 

As an example, Figure 5(b) shows the difference protocol ViffV-i that de- 
scribes all the complete execution trees of the protocol Pi of Figure 6(b) that are 
not allowed by the protocol Vo of Figure 6(a). From this, it can be inferred, e.g, 
that the sequence of messages login(+).selectGoods(-|-).POrder(+).cancelPO(-|-), 
which is derived from a complete execution path of the difference protocol, is a 
conversation which is allowed by the protocol Vi but it is not allowed by Vo- 

3.4 Projection 

In this section, we discuss the projection of a protocol obtained by using one of 
the previous operators (i.e, compatible composition, intersection, or difference) 
on a participant protocol. In the case of compatible composition, the projection 
of V\ | \ C V 2 on the protocol V \ , denoted as [Pi \ [Vij-p ^ , allows to identify the part 
of the protocol V\ that is able to interact correctly with the protocol V 2 ■ While 
a compatible composition protocol V allows to characterize the possible inter- 
actions between two business protocols each defining the behavior of a service 
playing a certain role in a collaboration, the projection of V allows the extrac- 
tion of a protocol that defines the role a service plays in a collaboration (e.g., a 
customer) defined by V . This is very important since the expected behavior of a 
service in a collaboration constitutes an important part of the requirements for 
the implementation. 

As an example, Figure 5(c) shows the projection of protocol T^l 1 1 c 7^2 of Fig- 
ure 5(a) on protocol 7^2 ■ The obtained protocol describes the part of protocol V 2 
that can be used to interact correctly with V\ (i.e, the role of V 2 in Vi\\ c V 2 )- 
Briefly stated, the projection of a protocol obtained using an intersection 
or a difference is defined as follows. In the case of intersection, the projection 
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(a) A protocol l\ (b) A protocol P 4 (c) A protocol P 5 



Fig. 6. Business protocols of two other store services. 



[Pi \ \ 1 V2]'p 1 allows to identify the part of the protocol V\ that is common with 
the protocol V 2 - In the case of difference, the projection [Pi \\ D V2]-p. l allows to 
identify the part of the protocol V\ that is not supported by the protocol V 2 - 
Below, we give the formal definition of the projection of a protocol obtained by 
using compatible composition, intersection, or difference. 

Definition 8 . (Projection) Let V = ( S , sq, T , M, TV) be a protocol obtained using 
compatible composition, intersection, or difference of two protocols V\ and V2 
(e.g., V = Vx\\ c V 2 ). A projection of V on the protocol V\, denoted as \P\-p , 
is a protocol (S', s' 0 , T' , M, 1Z) obtained from V by projecting the states of V 
on V\ (i.e., replacing each state (s),Sj) £ S by the state s \ in S’) and by 
defining the polarity function of the messages as follows: Polarity ([P]-p ,m) = 
Polarity (Vi,m),\/m £ M. 

4 Taxonomy of Protocols Compatibility 
and Replaceability 

This section analyzes service protocols compatibility and replaceability. Service 
compatibility refers to capabilities of services to interoperate while service re- 
placeability refers to the ability of a given service to be used instead of another 
service, in such a way that the change is transparent to external clients. We 
define and characterize several types of protocols compatibility (respectively, re- 
placeability). We show how, given two services and the corresponding protocols, 
it is possible to identify their compatibility (respectively, replaceability) levels 
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using the operators introduced in section 3. Instead of simple black and white 
compatibility and replaceability measures (i.e, whether two services are compat- 
ible or not, whether a service can replace another or not), we propose to consider 
different classes of protocols compatibility and replaceability. 



4.1 Compatibility Classes 

We identify two classes of protocols compatibility which provide basic building 
blocks for analysing complex interactions between service protocols. 

— Partial compatibility (or simply, compatibility): A protocol V\ is partially 
compatible with another protocol Vi if there are some executions of P 1 that 
can interoperate with P2, i.e., if there is at least one possible conversation 
that can take place among two services supporting these protocols 

— Full compatibility: a protocol V\ is fully compatible with another protocol 
Vi, if all the executions of V\ can interoperate with Vi, i.e., any conversation 
that can be generated by V\ is understood by Vi. 

These notions of compatibility are very useful in the context of Web services. 
For example, it does not make sense to have interactions with services for which 
there is no (partial or total) compatibility, as no meaningful conversation can be 
carried on. Furthermore, if there is only partial compatibility, the developer and 
the Web service middleware need to be aware of this, as the service will not be 
able to exploit its full capabilities when interacting with the partially compatible 
one: indeed, in this case, it is not sufficient that a service implementation is 
compliant with its advertised protocol, as additional constraints are posed by 
the fact that the service is interacting with another one whose protocols is only 
partially compatible, and hence some conversations are disallowed. 

As an example, Protocol V\ of Figure 1(a) can interact with the its reversed 
protocol Vi without generating errors and, hence, V\ is fully compatible with V\. 
However, this is not the case for the protocol V\ of Figure 1(a) and the protocol 
Vi of Figure 1(b). When the protocol Vi is at the state Selecting, it can send a 
message comparePrice, e.g., to look for the best price of a given product, and goes 
into a state Price Processing where it waits for an input message pricelnfo. These 
two transitions do not coincide with the transitions of the protocol V\ (i.e., 
the protocol V\ does not accept an input message comparePrice at the state 
Selecting nor it is able to generate an output message pricelnfo). Clearly, there 
are some executions of the protocol Vi that cannot interact with the protocol V \ . 
However, we can observe that there are some cases where the protocol Vi is able 
to interact correctly with the protocol V\ (i.e., there are some executions of Vi 
that are compatible with executions of the protocol V\). An example of such an 
interaction is given by the complete interaction tree between V\ and Vi depicted 
at Figure 4(b). Hence, the protocols V\ and Vi are(partially) compatible. 

We use the boolean operator P-compat(V\,Vi) (respectively, F-compat{V\, 
Vi)) to test if the protocol V\ is partially compatible (respectively, fully compat- 
ible) with the protocol Vi . The following lemma gives necessary and sufficient 
conditions to identify the compatibility level between two protocols. 
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Lemma 2. Let Pi and P2 be two business protocols, then 

(i) P-compat{V\,V 2 ) iff 'P i\\ c T > 2 is not an empty protocol (i.e., the set of its 
final states is not empty). 

(zi) F-compat(V i,P 2 ) iff [Pi\\ c V 2 ] Vl = Pi 

Note that full compatibility is not a symmetric relation, i.e., F-compat(fP 1, 
V 2 ) does not imply F-compat(V 2 ,Vi ) (however, F-compat(V i,P 2 ) implies P- 
compat('P 2 ,V 1 )) . 



4.2 Replaceability Classes 

Repleacability analysis helps us identify if we can use a service supporting a cer- 
tain protocol in place of a service supporting a different protocol, both in general 
and when interacting with a certain client. It also helps developer to manage ser- 
vice evolution, as when a service is modified there is the need for understanding if 
it can still support all conversations the previous version supported. We identify 
four replaceability classes between protocols, namely: equivalence, subsumption, 
replaceability with respect to a client protocol and replaceability with respect 
to an interaction role. These replaceability classes provide basic building blocks 
for analysing the commonalities and differences between service protocols. 

1. Protocols equivalence: two business protocols Pi and P 2 are equivalent if they 
are mutually substituable, i.e., the two protocols can be interchangeably used 
in any context and the change is transparent to clients. We use the boolean 
operator Equiv(V i,P 2 ) to test the equivalence of protocols Pi and V 2 . 

2. Protocol subsumption: a protocol Pi is subsumed by another protocol V 2 , 
if the externally visible behavior of Pi encompasses the externally visible 
behavior of V 2 , i.e., if Pi supports at least all the conversations that V 2 
supports. In this case, protocol P\ can be transparently used instead of 
V 2 but the opposite is not necessarily true. We use the boolean operator 
Subs(V i,V 2 ) to test if V 2 subsumes Pi. It should be noted that equivalence 
is stronger than subsumption (i.e., EquivlfP i,P 2 ) implies SubslfP i,P 2 ) and 
Subs(V 2 , Pi))- The protocol Pi of Figure 1(a) is subsumed by the protocol 
V 3 of Figure 6(a). 

3. Protocol replaceability with respect to a client protocol: The previous defini- 
tions discussed replaceability in general. However, it may be important to 
understand if a service can be used to replace another one when interacting 
with a certain client. This leads to a weaker definition of replaceability: a 
protocol Vi can replace another protocol V 2 with respect to a client protocol 
Pc, denoted as ReplmffV i,P 2 ), if Pi behaves similarly as V 2 when inter- 
acting with a specific protocol Pc. Hence, if Replmi (V-\ , P 2 ) than Pi can 
replace V 2 to interact with P c and this change is transparent to the client 
P c . It should be noted that, this replaceability class is also weaker than sub- 
sumption (i.e., SubslfP i,V 2 ) implies Repf-p c ] (Pi , P2) for any protocol P c ). 
For example, Protocol Pi of Figure 1(a) is not subsumed by protocol P4 of 
Figure 6(b), as P4 allows a client to cancel an order (message cancelPO(+) at 
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state Ordering) and also accept payments by credit card (message 
CCpayment(-) at state Invoicing) while V\ does not support these two possi- 
bilities. Hence, V\ cannot replace Va for arbitrary clients. However, we can 
observe that V\ can replace Va when interacting with the protocol V 2 of Fig- 
ure 1(b) as this client never cancel an ordering and it also always performs 
its payments by bank transfer. 

4. Protocol replaceability with respect to an interaction role: Let Vr be a busi- 
ness protocol. A protocol V\ can replace another protocol V2 with respect 
to a role Vr, denoted as Repl-Role[-p R ](V i,V 2 ), if V\ behaves similarly as 
V 2 when V2 behaves as Vr. This replaceability class allows to identify ex- 
ecutions of a protocol V2 that can be replaced by the protocol V\ even in 
the case when V\ and V2 are not comparable with respect to any of the pre- 
vious replaceability classes. The class RepLRole['p R ](Vi,V 2 ) is the weakest 
replaceability class (i.e., Rep('p R T i (Vi,V 2 ) implies ReplJiole^-p^ (Pi, V 2 ))- 
For example, consider the protocol V2 of Figure 1(b) and its reversed protocol 
V 2 - Protocol Va of Figure 6(b) cannot replace protocol V 2 when interacting 
with client V 2 (i.e., Repl[-p 2 \(Pa,V 2 ) does not hold). This is because protocol 
Va does not accept an input message comparePrice at the state Selecting. 
However, even in this case, a client V2 may be interested to know for which 
executions it can use Va instead of V2 ■ For example, we can observe that Va 
can replace V2 in all the interactions in which V2 behaves as the protocol 
Vr> of Figure 6(c) (i.e., RepLRole[-p 5 ](PA,V2)). In other words, the protocol 
V5 exhibits to a given client executions of V2 for which it is possible to use 
Va instead of V 2 ■ 

The following lemma characterizes the replaceability levels of two given pro- 
tocols using the operators introduced in previous sections. 

Lemma 3. Let V\, V 2 , V c and Vr he business protocols. 

1. Equiv(V a,V 2 ) iffVi =V 2 

2. SubsiV a,V 2 ) iffV 2 <Va 

3. Repl[ Vc ](V i,V 2 ) iff [V c \\ c V 2 } V2 <V\ (or equivalently, iff V c \\ c [V 2 \(Vi \ V2 is 
an empty protocol) 

l RepLRole [VR] (Pi,V 2 ) iff Pr < [ViW'V^. 

Note that this lemma provides two equivalent characterizations of class 
Rep( Vc \(V i,V 2 ) (item 3 of the lemma). The second characterization (i.e., 
V c \\ c [Vi\\ d V 2 ]' P ) can be useful to check whether V 2 can be used instead of V\ 
with respect to a client V c in those cases where the protocol V2 is not fully 
accessible (e.g., V2 is hidden for security reasons). Furthermore, such a charac- 
terization may be interesting for change support as it allows to incrementally 
check whether a given client protocol V c used to interact with a protocol V\ can 
still interact correctly with a new version V2 of the protocol V\. 

5 Discussion 

We believe that the effective use and widespread adoption of Web service tech- 
nologies and standards requires: (i) high-level frameworks and methodologies for 
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supporting automated development and interoperability (e.g., code generation, 
compatibility), and (ii) identification of appropriate abstractions and notations 
for specifying service requirements and characteristics. Service protocol manage- 
ment as proposed in this paper offers a set of mechanisms for the automation of 
services development and interoperability. 

Several efforts recognize aspects of protocol specifications in component- 
based models [7,13]. These efforts provide models (e.g., pi-calculus -based lan- 
guages for component interface specifications) and algorithms (e.g., compatibility 
checking) that can be generalized for use in Web service protocol specifications 
and management. Indeed, various efforts in the general area of formalizing Web 
service description and composition languages emerged recently [5, 11]. However, 
in terms of managing the Web service development life-cycle, technology is still 
in the early stages. The main contribution of the work presented in this paper is 
a framework that leverages emerging Web services technologies and established 
modeling notation (state machine-based formalism) to provide high-level sup- 
port for analyzing degrees of commonalities and differences between protocols 
as well interoperation possibilities of interacting Web services. In the following 
we briefly discuss how the framework presented in this paper can be leveraged 
to better support the service lifecycle management. 

Development-time support. . During service development, protocol analysis can 
assist in assessing the compatibility of the newly created service (and service 
protocol) with the other services with which it needs to interact. The protocol 
analysis will in particular help users to identify which part of the protocol are 
compatible and which are not, therefore suggesting possible areas of modifica- 
tions that we need to tackle if we want to increase the level of compatibility with 
a desired service. 

Runtime support. In terms of runtime support, the main application of compat- 
ibility analysis is in dynamic binding. In fact, just like for static binding, the 
benefit of protocol analysis is that search engines can restrict the services they 
return to those that are compatible. This is essential, as there is no point in 
returning services that are not protocol-compatible, since no interoperation will 
be possible (unless there is a mediation mechanism that can make interaction 
possible, as discussed below). 

Change Support.. Web services operate autonomously within potentially dynamic 
environments. In particular, component or partner services may change their 
protocols, others may become unavailable, and still others may emerge. Con- 
sequently, services may fail to invoke required operations when needed. The 
proposed operators allow to statically and dynamically identify alternative ser- 
vices based on behavior equivalence and substitution. Protocols analysis and 
management provide also opportunities to: (i) help understanding mismatch be- 
tween protocols, (ii) help understand if a new version of a service protocol is 
compatible with the intended clients, and the like. 

The framework presented in this paper is one of the components of a broader 
CASE tool, partially implemented, that manages the entire service development 
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lifecycle [2, 3, 10] . In this paper, we focused on analysing and managing Web ser- 
vice protocols. Another component (presented in [2]) of this framework features 
a generative approach where conversation protocols are specified using an ex- 
tended state machine model, composition models are specified using stateclrarts, 
and executable processes are described using BPEL. Through this component, 
users can visually edit service conversation and composition models and au- 
tomatically generate the BPEL skeletons, which can then be extended by the 
developers and eventually executed using BPEL execution engine such as the 
IBM’s BPWS4J (www.alphaworks.ibm.com/tech/bpws4j). We are also consid- 
ering the extension of the proposed approach to cater for trust negotiation and 
security protocols in Web services, by exploring the leverages between conversa- 
tion and trust negotiation protocols [10], that can both be specified through state 
machines, although at different levels of abstractions. The proposed framework 
supports also lifecycle management of trust negotiation protocols [10]. We intro- 
duced a set of change operations that are used to modify protocol specifications. 
Strategies are presented to allow migration of ongoing negotiations to a new 
protocol specification. Details about these features of framework can be found 
in [2,10]. Finally, our current research also includes addressing the problem of 
designing and testing for compatibility, trying in particular to understand how to 
develop test cases that can test that two services can interact and how to devise 
a methodology for service development that facilitates protocol compatibility. 
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Abstract. A major issue in the study of semantic Web services concerns the 
matching problem of Web services. Various techniques for this problem have 
been proposed. Typical ones include FSM modeling, DAML-S ontology match- 
ing, description logics reasoning, and WSDL dual operation composition. They 
often assume the availability of concept semantic relations, based on which the 
capability satisfiability is evaluated. However, we find that the use of semantic 
relations alone in the satisfiability evaluation may lead to inappropriate results. 
In this paper, we study the problem and classify the existing techniques of satis- 
fiability evaluation into three approaches, namely, set inclusion checking, con- 
cept coverage comparison and concept subsumption reasoning. Two different 
semantic interpretations, namely, capacity interpretation and restriction inter- 
pretation, are identified. However, each of the three approaches assumes only 
one interpretation and its evaluation is inapplicable to the other interpretation. 
To address this limitation, a novel interpretation model, called CRI model, is 
formulated. This model supports both semantic interpretations, and allows the 
satisfiability evaluation to be uniformly conducted. Finally, we present an algo- 
rithm for the unified satisfiability evaluation. 



1 Introduction 

Much attention has been paid to the study of semantic Web services [5], which aims 
to explore a better way to describe and implement Web services, making them acces- 
sible to automated agents. Ontology plays a key role in the Semantic Web by provid- 
ing common vocabularies shared by applications. Currently, the Web Services De- 
scription Language (WSDL) [19] does not support semantic descriptions for Web 
services [4], and the Universal Description, Discovery and Integration ( UDD1) [20] 
also cannot provide adequate documentation of Web service capabilities [4]. To ad- 
dress their limitations, ontology languages (e.g., DAML+OIL [15]) have been pro- 
posed whose goal is to facilitate the automation of tasks including Web service dis- 
covery, execution, composition and interoperation. 

Capability matching of Web services is a major research issue in this field. Gener- 
ally, when an agent is given a service request, it tries to match it against available 
service advertisements stored in its Web service repository. The matching is con- 
ducted in terms of capabilities based on the underlying semantics of the concepts 
involved. A variety of techniques have been proposed. Popular ones include finite- 
state machine ( FSM) modeling [2], DAML-S [16] ontology matching [11], Descrip- 
tion Logics (DLs) [1] reasoning [4], and WSDL dual operation composition [6]. 

A service advertisement is said to be matched for a service request only when the 
capabilities of this advertisement satisfy the requirements of this request. The match- 
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ing process is called satisfiability evaluation. We classify existing techniques in the 
satisfiability evaluation into three approaches: 

• Set Inclusion Checking: In this approach, each set contains available resources or 
required resources. The goal of checking is to find inclusion relations between 
sets. For example, if B t = [ “SOAP” } [18] and B, = {“SOAP”, “HTTP”}, B 2 is said 
to be able to satisfy B t [6]. 

• Concept Coverage Comparing: This approach uses concepts to represent capa- 
bilities or requirements. In comparison, a more general concept can satisfy a more 
specific concept. For example, concept Vehicle satisfies concept Car [11]. 

• Concept Subsumption Reasoning: In this approach, capabilities or requirements 
are represented in DL clauses. In reasoning, a more specific DL clause can satisfy 
a more general DL clause. For example, V item. (PC' r > 256 memorySize) satisfies 

\!item.(PCr\ > 128 memorySize ) [4], 

The first two approaches are essentially equivalent. To illustrate this, we assume 
that Vehicle has two sub-concepts: Car and Bus. After we rewrite Vehicle as {Car, 
Bus} and Car as {Car}, Vehicle satisfies Car iff {Car, Bus} includes {Car}. Next, the 
last two approaches have opposite criteria for the satisfiability evaluation. The reason 
is that each DL clause can be regarded as a concept. Finally, all the three approaches 
make use of semantic relations between concepts. 




Fig. 1 . The Vehicle and PC Examples 



However, the use of concept semantic relations alone in the satisfiability evaluation 
may lead to inappropriate results. It is because a concept can be subject to multiple 
interpretations under different contexts. The Vehicle example actually adopts the 
capacity interpretation of Vehicle that refers to the semantics of Vehicle and all its 
sub-concepts. Vehicle can satisfy Car under this context. But the PC example adopts 
restriction interpretations of two DL clauses: \/item.(PCc\ > 256 memorySize ) and 
\/itetn.(PCn > 128 memorySize ). Under this context, the former is more restrictive 
than the latter. Therefore, the requirement of a PC with memory larger than 128MB 
can be satisfied by the offering of a PC with memory larger than 256MB (Fig. 1). 

The above two examples indicates that an offering labeled by a more general con- 
cept does not necessarily satisfy a requirement labeled by a more specific concept, 
and vice versa. The satisfiability evaluation should depend on how the involved con- 
cepts are semantically interpreted. We call this semantic interpretation which can be 
based on either capacity or restriction. More rigorous definitions for these interpreta- 
tions will be given later. 

Each of the three approaches discussed actually assumes only one semantic inter- 
pretation, and its satisfiability evaluation is inapplicable to the other semantic inter- 
pretation. To address this limitation, we propose a unified capability matching model 
for Web services, in which the satisfiability evaluation can be performed uniformly 
based on two semantic interpretations. 




544 Chang Xu, Shing-Chi Cheung, and Xiangye Xiao 



The remainder of this paper is organized as follows. Section 2 introduces related 
work. Section 3 proposes our solution to the Web service capability matching prob- 
lem. Section 4 discusses how to implement the capability matching system. Section 5 
evaluates our system using four well-known criteria, and presents our two extra fea- 
tures. In the last section, we conclude our contributions and explore future work. 



2 Related Work 

Various studies have been conducted on the service capability matching issue. Gao et 
al. [2] formally defined exact match and plug-in ??iatch for Web service capabilities 
based on abstract FSM models. They also proposed a capability description language 
SCDL and a solution for comparing service capabilities. However, the work focuses 
mainly on signature matching rather than semantic matching. Moreover, it is difficult 
to use the FSM approach to precisely describe Web service capabilities. 

Paolucci et al. [11] suggested semantic matching for Web services and proposed a 
solution based on DAML-S. The work presents a matching framework to examine 
every advertisement and request pair. It also suggests flexible matching by providing 
four matching degrees based on a predefined taxonomy tree. However, the work only 
considers capacity interpretations of concepts. Furthermore, the technique addresses 
mainly is-a relations between concepts. 

Li and Horrocks [4] described a service matchmaking prototype using a DAML-S 
ontology and a DL reasoner. This approach is useful to describe the semantics of 
service adverts and queries. However, each DL expression can be regarded as a con- 
junction of several restriction clauses. If a given advert and query contain different 
restriction clauses at some corresponding position, the algorithm cannot work well. 
To address this problem, they proposed to minimize the DL expression under exami- 
nation by moving unimportant providedBy and requestedBy clauses out of the profde 
that participates in comparison. This adjustment only partially solves the problem 
because the remaining part of the profde may also include some information that will 
affect the matching result. Another limitation of the work is that the authoring of DL 
expressions requires users’ familiarity of formal logics. 

Medjahed et al. [6] proposed an ontology-based framework for Web service com- 
position. Several composability rules are deployed to compare the syntactic and se- 
mantic features of Web services. A composability model that covers from low-level 
binding mode to high-level composition soundness was formulated. In the satisfiabil- 
ity evaluation, the framework only adopts simple set inclusion checking. Essentially, 
the approach is based on the capacity interpretations of concepts. 

Other related work includes LARKS [12] and ARLAS [10] matchmaking systems. 
LARKS defines five techniques for service matchmaking. Those techniques mostly 
compare service text descriptions, signatures, and logical constraints about inputs and 
outputs [6]. ATLAS defines two methods to compare services for functional attributes 
and capabilities [6] . 

Generally, existing techniques decide matching results based on relations between 
compared concept pairs. In our approach, these relations are used as the intermediate 
result in the satisfiability evaluation. To accurately perform capability matching, we 
take into account the semantic interpretations of concepts involved. Our satisfiability 
evaluation differs from others in that it uniformly performs multiple satisfiability 



Semantic Interpretation and Matching of Web Services 545 



evaluations based on semantic interpretations of concepts. Therefore, the final match- 
ing result is a synthesis of multiple satisfiability evaluations. 



3 Semantic Interpretation and Capability Matching 

Let us first overview some preliminary concepts on conceptual modeling before pre- 
senting our satisfiability evaluation technique, which is based on semantic interpreta- 
tions of concepts. A capability matching algorithm will be subsequently formulated. 

3.1 Preliminary 

Definition 1 (Concept-Instance Structure): A concept-instance (Cl) structure CIS = 
( C, I, cimap) is a triple where: 

• C is a set of concepts; 

• I is a set of instances; 

• Total function cimap: C —> 2 1 relates a concept in C with a non-empty set of in- 
stances in I. 

Fig. 2 gives the graphical representation of an illustrative example. The following 
is the corresponding Cl structure: 

CIS = (C, /, cimap) 

C = { Vehicle , Bus, Car , Sedan , SUV} 

I = { h v b r se v se 2 , suv] 
cimap(Vehicle) = { £>,, b 2 , se v se 2 , suv j 
cimap(Bus) = {b^bf 
cimap(Car) = se„ .smv} 
cimap(Sedan') = {se x , se 2 } 
cimap(SUV) = {swv} 

Definition 2 (Equivalent, subsumed, including, intersecting and disjoint): In a Cl 

structure CIS, any two concepts c, and c 2 are subject to one of the following relations: 

• equivalent: if cimap a fcj = cimap CIS (c 2 )', 

• subsumed: if cimap a fcJ a cimap CIS (c 2 ); 

• including: if cimap CIS (cJ ZD cimap CIS (c 2 ); 

• disjoint: if cimap a fcJ n cimap CIS (c 2 ) = ([> 

• intersecting: otherwise. 

Definition 3 (Descendent concept): If con- 
cepts c, and c, have an equivalent or sub- 
sumed relation, c 2 is said to be a descendent 
concept of c 2 . 

Since the equivalent relation is reflexive, 
a concept is a descendent concept of itself. 

Tables 1 and 2 give the concept relations 
and the descendent concepts, respectively, 
for the example of Fig. 2. 




l«i. «'?} !*«>’! 

^ subClassOf 

Fig. 2. An illustrative example 
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Table 1. The concept relation table (c, to c 2 ) 





Vehicle 


Bus 


Car 


Sedan 


SUV 


Vehicle 


equivalent 


including 


including 


including 


including 


Bus 


subsumed 


equivalent 


disjoint 


disjoint 


disjoint 


Car 


subsumed 


disjoint 


equivalent 


including 


including 


Sedan 


subsumed 


disjoint 


subsumed 


equivalent 


disjoint 


SUV 


subsumed 


disjoint 


subsumed 


disjoint 


equivalent 



Table 2. The descendent concept table 



For Vehicle 


Vehicle, Bus, Car, Sedan, SUV 


For Bus 


Bus 


For Car 


Car, Sedan, SUV 


For Sedan 


Sedan 


For SUV 


SUV 



Definition 4 (Coverage): Suppose that c, c n are concepts in a CI structure CIS. If 

I is a set of instances that satisfies / c cimap CIS (c, ) u ... u cimap CIS (c„ ) and 

(Bee cimap cls (c x ),e& /)a...a(3<?g cimap CIS (c n ),es I), I is said to have a coverage 

relation with {c v ..., cj, written as: I R cm {c p .... cj. 

If I has a coverage relation with { Cj , ..., cj, it means that I is subj ect to some rela- 
tion with each concept in set {c p .... c„}. Some coverage relations for the example in 
Fig. 2 are given below: 

{bfR'fiBUS} {b 2 } R cm {BUS} {b v b 2 ) R cov {BUS} 

{se t , ii/v} R cm {Sedan, SUV } {se 2 , suv } R m {Sedan, SUV } 

{se 2 , se 2 , suv } R cm {Sedan, SUV} 



3.2 Satisfiability Evaluation 
3.2.1 Semantic Interpretation 

Definition 5 (Restriction interpretation): Ifc is a concept in a CI structure CIS, the 
restriction interpretation td* s (c) of c is the set of all sets that has a coverage relation 
with {cj: (c) = {1 1 1 R cm { cjj . 

Definition 6 (Capacity interpretation): If c is a concept in a CI structure CIS, the 
capacity interpretation (c) of c is the set of all sets that has a coverage relation 
with the set of all descendent concepts of c: td PIS (c) = (I | 1 R cov {c 1 ,...,cj, c p ...,c n are 
all descendant concepts of cj. 

The restriction interpretation of a concept represents the semantics of this concept, 
while the capacity interpretation of a concept represents the semantics of all descen- 
dent concepts of this concept. Some restriction interpretations and capacity interpreta- 
tions for the example in Fig. 2 are given below: 

d c * (Sedan) = ( {.sc, }, {se 2 }, {se^ ve 2 } } (Sedan) = { { }, {se,}, {se p se 2 } } 

d c *(St/V)={{.wv)) ‘ ‘ (SUV) = {{suv}} 

H CIS (Car) = { { .vgj } , {se,}, {suv}, {se t , se 2 }, {se v suv}, {ie 2 , suv}, {ie p se 2 , smv)) 
d c * (Car) = { {sej, suv}, {se 2 , suv}, {se t , se 2 , suv} } 
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Definition 7 (Semantic interpretation): An interpretation is a semantic interpreta- 
tion iff it is a restriction interpretation or a capacity interpretation. 

Semantic interpretation is used for the satisfiability evaluation in Web service ca- 
pability matching. Each concept has two semantic interpretations: restriction interpre- 
tation and capacity interpretation. 

3.2.2 Semantics Satisfiability 

Definition 8 (Semantics satisfiability): The evaluation for semantics satisfiability 
compares the semantic interpretations of concepts. Given two concepts c t and c 2 in a 
Cl structure CIS and their semantic interpretations fi CIS ( cfi and fi CIS ( c 2 ), 

• If Ve 2 e ft as (c 2 ),3e l e ft as (c t ),e t c e 2 , ® c J c i) is said to be satisfied by fi CIS (c 2 ). 

• Otherwise, fi as ( cj is said not to be satisfied by fi as ( c 2 ). 

Definition 8 unifies the impact of two different semantic interpretations so that the 
satisfiability evaluation can be done uniformly under different contexts. Let us denote 
< as the “is satisfied by ” relation. Some semantics satisfiability results for the exam- 
ple in Fig. 2 are given below: 

ft c ^ s (Bus) < ( Vehicle ) (Sedan) < ft^ IS (Car) 

as (Vehicle) < ft* s (Bus) * (Car) <1 ft* (Sedan) 

ft c l (Sedan) < ft c * s (Sedan) ft^ffSUV) < ft* s (SUV) 

ft a S (Sedan) <1 tl CB (Car) ft* s (Car) <1 (Sedan) 

3.2.3 Satisfiability Result 

Given two concepts, the result of the satisfiability evaluation depends on two factors: 
(i) the relation between them and (ii) the semantic interpretations of them. The fol- 
lowing satisfiability result table can be derived from Definition 8: 



Table 3. The result of satisfiability evaluation of (cj and $ C/S (c 2 ) 



Relation 

(ci and c 2 ) 


equivalent 


subsumed 


including 


intersecting 


disjoint 


Figure 




— / c * 

Ofr 


ci \/ ci 


Cl C2 


Cl \ ^ 


Satisfiability 

result 


ci^\ 


R 


C 




R 


c 


Cl^\ 


R 


c 


Cl X 


R 


c 


Cl 


R 


c 


R 




V 


R 


X 


V 


R 


V 


V 


R 


X 


*2 


R 


X 


X 


C 


*. 


4 


C 


X 


4 


c 


X 


X 


c 


X 


X 


c 


X 


X 



Notes: 

• R: restriction interpretation; C: capacity interpretation. 

• ®c,s( c i) < ^c, S (G); x: -‘(fiafci) < 

• # CB (c t ) < f> CB (c,) iff all descendent concepts of c 1 and c 2 are equivalent; * 2 : 

<1 $ CK (c 2 ) iff Cj has descendent concepts other than itself and each de- 
scendent concept c satisfies cimap CIS (c) n cimap CIS (c, ) n cimap as (c 2 ) * <f> ■ 



3.2.4 Satisfiability Theorems and Transitive Law 

To effectively use the satisfiability result table in Web service capability matching, 
we generalize the following four satisfiability theorems and a transitive law: 
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Theorem 1. The restriction interpretation / capacity interpretation of a concept can 
satisfy the restriction interpretation / capacity interpretation of any concept equiva- 
lent to this concept. 

Theorem 2. The capacity interpretation of a concept can satisfy the restriction inter- 
pretation of any concept equivalent to this concept. 

Theorem 3. The capacity interpretation of a concept can satisfy the capacity inter- 
pretation of any descendent concept of itself. 

Theorem 4. The restriction in terpretation of a concept can be satisfied by the restric- 
tion interpretation of any descendent concept of itself. 

Theorem 5. Transitive law applies to semantics satisfiability. 

Theorems 1 and 2 concern the equivalent part of Table 3, while theorems 3 and 4 
concern the subsumed part and the including part of the table, respectively. The proof 
of Theorem 1 is naturally followed from the definitions of capacity and restriction 
interpretations. The proofs for the other four theorems can be found in [13]. 

3.2.5 Satisfiability Evaluation 

Suppose that c v ..., c n are concepts in a Cl structure CIS, such that c n is a descendent 
concept of c nl , ..., and c 2 is a descendent concept of c r Note that c n has no descendent 
concept other than itself. We 
have two satisfiability se- 
quences: (Cj) < ... < 

(c„), and ■&£ (c„) < ... < 

(Cj), where (c„) = (c ) 

(Fig. 3). 

The first sequence gives the 
satisfiability evaluation result 
between restriction in- 
terpretations. A more specific 
concept can satisfy a more 
general concept under this 
semantic interpretation. It is 
what the concept subsumption 
reasoning approach is to 
achieve. The second sequence 
gives satisfiability evaluation 
result between capacity inter- 
pretations. A more general concept can satisfy a more specific concept under this 
semantic interpretation. It is what the concept coverage comparing approach is to 
achieve. 

Our satisfiability evaluation generalizes the above two approaches by supporting 
both restriction and capacity inter- 
interpretations. In addition, it allows 
dynamic specification of semantic 
interpretations. This feature is use- 
useful. Let us assume a binding 
protocol ontology in Fig. 4. A 
service provider, which supports 




subClassOf 



Fig. 4. A binding protocol ontology 
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SOAP, HTTP and MIME, advertises his protocols as All Binding Protocols using a 
capacity interpretation. Users looking for a service supporting all binding protocols 
can make a request for All Binding Protocols using a capacity interpretation. How- 
ever, if users are happy with any binding protocol, they can select a restriction inter- 
pretation instead. Dynamic specification of semantic interpretations enables service 
providers and users to flexibly express their capabilities and requirements. 



3.3 CRI Models 



Fig. 5 defines an ontology describing the semantics of service advertisements and 
requests using the Ontology Web Language (OWL). This ontology is similar to the 
latest OWL-S 1.0 [17] ( OWL-S is an OWL- based Web service ontology). 




Fig. 5. The ontology for describing Web service advertisements and requests 

The ontology presents a meta-model of Web service descriptions. Important con- 
cepts that participate in the satisfiability evaluation are specified using capacity inter- 
pretation or restriction interpretation. Each important concept is also assigned a spe- 
cific satisfiability evaluation process. The specification of semantic interpretations 
and the assignment of satisfiability evaluation processes are described by means of 
CRI models. 

Definition 9 (CRI model): A Capacity-Restriction Interpretation (CRI) model, CRIM 
= ( O , imap, emap), is a capability matching oriented structure, where: 

• O is a Web service ontology (e.g., the one defined in Fig. 5) for describing the 
semantics of service advertisements and requests. 

• imap: C — > ( capacity , restriction} is a partial function that relates a concept with 
a semantic interpretation (capacity interpretation or restriction interpretation). C 
is a concept defined in O. 
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• emap: C — > { equivalence , plug-in, subsumption } is a partial function that relates 
a concept with a satisfiability evaluation process (equivalence evaluation, plug-in 
evaluation, or subsumption evaluation). C is a concept defined in O. 

Different satisfiability evaluation processes result in different impacts on capability 
matching. Subsumption evaluation requires that a concept from the request should 
subsume or equivalent to the corresponding concept from the advertisement (like the 
concept subsumption reasoning approach), while plug-in evaluation has a right oppo- 
site criterion (like the concept coverage comparing approach). Finally, equivalence 
evaluation asks for an equivalent relation between two involved concepts. 

CRI models are used to describe Web service semantics, especially focusing on the 
specification of semantic interpretations and the assignment of satisfiability evalua- 
tion processes. Its significance lies in that it is the first time, to our best knowledge, to 
explicitly separate the assignment of satisfiability evaluation processes from the se- 
mantics representation of Web services, such that the satisfiability evaluation can be 
changed or improved independent of the service semantics representation. Many re- 
searches implicitly think that all concepts have capacity interpretations, and therefore, 
the satisfiability evaluation is fixed to the comparison of concept coverage. Some 
researches adopting DLs lack a flexible assignment mechanism for different satisfi- 
ability evaluation processes, only allowing concept subsumption reasoning. CRI mod- 
els improve the matching accuracy by allowing dynamic specification of semantic 
interpretations and flexible assignment of satisfiability evaluation processes. 

In a CRI model, the ontology, semantic interpretations and satisfiability evaluation 
processes are dynamically bound. The deciding of the most appropriate assignment of 
satisfiability evaluation processes mostly depends on the semantic interpretations of 
concepts involved, but the concept meanings and the related matching requirements 
also have their impacts. At present, our research suggests the following CRI model: 

• 0\ The ontology defined in Fig. 5. 

• imap : (1) ServiceCategory, taxonomy. Thing — > capacity; (2) Input, parameterType, 
Thing — > restriction; (3) Output, parameterType, Thing — > restriction. 

• emap: (1) Sen’iceCategoiy, taxonomy. Thing — > plug-in; (2) Input, role. Thing — > 
equivalence; (3) Input, parameterType, Thing — > subsumption; (4) Output, role. Thing 
— > equivalence; (5) Output, parameterType, Thing — > subsumption. 

Note that the bold literals are concepts under consideration. We use a prefix before 
each Thing concept to specify which Thing we are referring to (Fig. 5). Explanations 
for this suggested CRI model are given below. 

The term taxonomy represents service category. A more general taxonomy means a 
more powerful service (e.g. renting _vehicle can satisfy renting_car). We specify 
taxonomy with a capacity interpretation. parameterType represents range restriction 
for an input/output parameter (e.g. we may need an integer input, or we plan to pro- 
duce afloat output). We specify both parameterType s to have restriction interpreta- 
tions. 

The assignment of satisfiability evaluation processes has close relations with se- 
mantic interpretations specified by function imap. Realizing that the taxonomy has a 
capacity interpretation, we examine Table 3 with both c l and c, subject to capacity 
interpretations. We find that c, is satisfied by c 2 only when they have equivalent or 
subsumed relations. As such, concept taxonomy matches successfully only when the 
taxonomy from a service request could be plugged in or equivalent to the correspond- 
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ing taxonomy from a service advertisement. Theorem 1 and Theorem 3 can infer this 
conclusion in a more convenient way. Therefore, we assign plug-in evaluation process 
to taxonomy. Similarly, we assign subsumption evaluation process to parameterType. 
This time we check Table 3 based on restriction interpretations of c l and c,. Alterna- 
tively, we use Theorem 1 and Theorem 4 to infer it directly. 

We have two more mappings about role. A role represents the business role of In- 
puts or Outputs of a Web service process. In the satisfiability evaluation, two roles 
from an advertisement and a request respectively should be similar or equal in mean- 
ing. We assign equivalence evaluation process to them. 

3.4 Capability Matching Algorithm 

The capability matching algorithm examines each advertisement in the Web service 
repository to find whether there is one fully or partially satisfying the given request. 
For each advertisement and request pair, the algorithm does the satisfiability evalua- 
tion based on the ServiceProfile and ServiceModel semantics (Fig. 5). For the Ser- 
vice! 3 rofile part, the algorithm performs plug-in evaluation for taxonomy as our CRI 
model suggests. But for the ServiceModel part, it is more complex. 

A ServiceModel can have several Processes, and each Process may have multiple 
Inputs/ Outputs. For full capability matching, the algorithm checks whether for each 
Process declared in the request, the advertisement can offer a satisfying Process. 
However, for partial capability matching, the criterion is reduced to whether there is 
at least one Process declared in the request such that the advertisement can offer a 
satisfying Process. Partial capability matching serves for service compositions. 

No matter whether using full or partial capability matching, the algorithm checks 
all Inputs and Outputs in the Process pair being examined. A match is recognized 
when the following two conditions are met: (i) for each Input in the Process from the 
advertisement, there is a matched Input in the Process from the request (it guarantees 
that the request can provide sufficient Inputs ); (ii) for each Output in the Process from 
the request, there is a matched Output in the Process from the advertisement (it guar- 
antees that the advertisement can provide sufficient Outputs). 

For the Input/Output part, the algorithm performs equivalence evaluation for role 
and subsumption evaluation for parameterType, following our CRI model. Due to 
space limitation, only the pseudo code for full capability matching algorithm is given 
below. 

List capabilityMatching (req) { 

List result = new ListO; 

forall adv in advertisement_repository do { 

if checkProf ile (getProf ile (req) , getProf ile (adv) ) == true 
&& checkModel (getModel (req) , getModel (adv) ) == true 
then result . add (adv) ; 

1 

return result; 

1 

boolean checkProf ile (reqProf ile, advProfile) { ... } 

boolean checkModel (reqModel , advModel) { 

forall reqProcess in getAUProcesses (reqModel) do { 

if not exists (advProcess) in getAUProcesses (advModel) 
such that checkProcess (reqProcess , advProcess) == true 
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then return false; 

} 

return true; 

} 

boolean checkProcess (reqProcess , advProcess) { 
forall advlnput in getAHInputs (advProcess) do { 
if not exists (reqlnput) in getAHInputs (reqProcess 
such that checklnput (advlnput, reqlnput) == true 
then return false; 

} 

forall reqOutput in getAHOutputs (reqProcess) do { ... } 
return true; 

} 

boolean checklnput (advlnput , reqlnput) { 

if equivalent (getRole (advlnput) , getRole (reqlnput) ) == true 
&& subsuming (getParamType (advlnput) , 
getParamType (reqlnput) ) == true 
then return true else return false; 

} 

boolean checkOutput (reqOutput , advOutput) { ... } 



4 Implementation 

Our Web service capability matching system adopts a three-tier architecture as illus- 
trated in Fig. 6. System Service Layer provides four functionalities: advertisement 
browsing, advertisement/request consistency checking, request matching, and service 
composition. Ontology Matching Layer realizes ontology consistency checking and 
ontology structure matching for Web services based on xlinkit technology. Satisfiabil- 
ity Evaluation Layer is responsible for the satisfiability evaluation. 




Fig. 6. The architecture of our Web service capability matching system 

Detailed implementation considerations including the adaptation of xlinkit to ser- 
vice matching purposes are explained in [13]. The applications of xlinkit in the soft- 
ware engineering field can be found in [7] [8] [9]. 



5 Evaluation 

Paolucci proposed four criteria in [11] for evaluating a service capability matching 
system: (i) the system should support semantics matching between advertisements 
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and requests based on ontology; (ii) the system should minimize false positives and 
false negatives; (iii) the system should encourage service providers and users to be 
honest with their descriptions at the cost of paying the price of either not being 
matched or being matched inappropriately; and (iv) the matching process should be 
efficient. 

Our matching system strives to satisfy all the four criteria. Firstly, the matching is 
based on an OWL ontology. The system can perform semantic inferences leading to 
the recognition of capability matching despite of their syntactic differences. Secondly, 
the use of OWL also supports accuracy: no matching is recognized when the satisfi- 
ability criteria cannot be met for a given advertisement and request pair. Thirdly, 
dishonesty behaviors (e.g. arbitrarily aggrandizing capabilities) by service providers 
and users will lead to no matching or inappropriate matching, which will harm them 
eventually. Finally, in order to increase performance, our system adopts an efficient 
xlinkit tool for performing ontology structure matching between advertisements and 
requests, and uses a special concept relation reasoner for the kernel satisfiability 
evaluation. 

In addition, our matching system has two more features: (i) the use of the notation 
of capacity interpretation and restriction interpretation to adapt to different contexts 
for improving the accuracy of semantics representation; and (ii) utilization of CRI 
models to support flexible assignment of satisfiability evaluation processes for im- 
proving the accuracy of satisfiability evaluation. 



6 Conclusions and Future Work 

In this paper, we studied several Web service capability matching models, and classi- 
fied their satisfiability evaluation techniques into three approaches. We analyzed their 
limitations under two semantic interpretations: restriction interpretation and capacity 
interpretation. To address these limitations, we proposed a unified Web service capa- 
bility matching model, in which the satisfiability evaluation can be performed in a 
uniform way based on semantic interpretations of concepts. On the basis of the satis- 
fiability result table and satisfiability theorems, we have proposed CRI models as a 
promising solution to the Web service capability matching problem. The correspond- 
ing algorithm can be implemented based on xlinkit technology. 

At present, our implementation is subject to a communication bottleneck between 
the Ontology Matching Layer and the Satisfiability Evaluation Layer. We are working 
at various means to alleviate this problem such that our prototype could be refined 
into a practical semantic Web service tool. More features (e.g., satisfiability evalua- 
tion between capacity interpretation and restriction interpretation) will be incorpo- 
rated in the refined CRI model. We also plan to extend the proposed semantic evalua- 
tion to the support of mobile and pervasive services [3][ 14]. 
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Abstract. Identity management has arisen as a major and urgent challenge for 
internet-based communications and information services. Internet services in- 
volve complex networks of relationships among users and providers - human 
and automated - acting in many different capacities under interconnected and 
dynamic contexts. There is a pressing need for frameworks and models to sup- 
port the analysis and design of complex social relationships and identities in or- 
der to ensure the effective use of existing protection technologies and control 
mechanisms. Systematic methods are needed to guide the design, operation, 
administration, and maintenance of internet services, in order to address com- 
plex issues of security, privacy, trust and risk, as well as interactions in func- 
tionality. All of these rely on sophisticated concepts for identity and techniques 
for identity management. 

We propose using a requirements modeling framework GRL to facilitate iden- 
tity management for Internet Services. Using this modeling approach, we are 
able to represent different types of identities, social dependencies between iden- 
tity users and owners, service users and providers, and third party mediators. 
We may also analyze the strategic rationales of business players/stakeholders in 
the context of identity management. This modeling approach will help identity 
management technology vendors to provide customizable solutions, user or- 
ganizations to form integrated identity management solution, system operators 
and administrators to accommodate changes, and policy auditors to enforce in- 
formation protection principles, e.g.. Fair Information Practice Principles. 



1 Introduction 

Networks, and businesses, are about relationships. As a result of the convergence 
brought about by IP technologies, numerous Internet services are now available on a 
common network. Internet services thus involve complex networks of relationships 
among users and providers - human and automated - acting in many different capaci- 
ties under interconnected and dynamic contexts. A typical user may employ many 
services, concurrently or at different times. Many services are complementary and 
may invoke each other to achieve functionalities, while others could have adverse 
interactions. Thus, identity management has arisen as a major and urgent challenge 
for internet-based communications and information services. 

Identity management concepts and technologies have been developed in the com- 
puter security area in a traditional internal enterprise context. They are currently being 
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extended to deal with the much more complex, open IP-network environment. While 
advances in fundamental technologies and mechanisms are crucial, there is a pressing 
need for frameworks and models to support the analysis and design of complex social 
relationships and identities in order to ensure the effective use of these technologies 
and mechanisms. Systematic methods are needed to guide the design, operation, ad- 
ministration, and maintenance of internet services, in order to address complex issues 
of security, privacy, trust and risk, as well as interactions in functionality. All of these 
rely on sophisticated concepts for agent identity and techniques for identity manage- 
ment. Thus, effective identity management is crucial for the successful transition to 
IP-based integrated communication and information services. 

Based on our previous work on trust analysis [14], security and privacy require- 
ments modeling [13,6], we propose using a modeling framework to facilitate the iden- 
tity management for Internet services. By using this modeling approach, we are able 
to represent identities of different natures, describe social dependencies between iden- 
tity users and owners, service users and providers, third party mediators. We may also 
analyze the strategic rationales of these business players/stakeholders in the setting of 
identity management. With the support of this modeling approach, vendors, who 
involve in the development and sale of identity management, privacy and security 
products and service, will be able to provide customizable solutions to different users. 
Organizations, who want to deploy identity management in its business settings, will 
be able to form an integrated solution that addresses all competing high-level con- 
cerns at the same time. System operators and administrators will be able to accommo- 
date to changes in the system and environments and maintain the security levels at run 
time. Policy makers and auditors, who want to enforce certain principles, e.g., Fair 
Information Practice Principles, will be able to evaluate whether a proposed identity 
management solution complies with such intended principles by using various model 
analysis techniques. 

In section 2, we introduce identity modeling structures based on GRL, illustrating 
the basic concepts and how they can help understand and analyze the basic settings of 
identity management. Section 3 provides a systematic approach for identity design, in 
which a general design model is given. Section 4 concludes the work and discusses 
related work and future directions. 



2 Towards an Intentional Identity Management 
Modeling Framework 

Identity management is a difficult process that can be facilitated by the use of tech- 
nology. However, it is not just a matter a developing a piece of technology that people 
can use to manage their identities. There are already a wide range of products and 
mechanisms that can help - passwords, smart-cards, personal organizers, to name but 
a few. What is in much urgent need is a framework within which these products can 
fit into, and within which new products and services can be developed where re- 
quired. 

In this section and next, we use GRL modeling constructs to build a generic iden- 
tity management meta-model, with which the requirements for identity management, 
the environment within which it must exist, and the architectural designs that have 
been proposed so far can be analyzed at an intentional level. 
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Although conventional entity-relationships modeling can provide the foundation 
for identity management and access control, it does not provide constructs to describe 
the intentions of ID owners and users. Thus, it is very hard to define and reason about 
the context and the motivation. As a consequence, the design of the identity manage- 
ment and access control solutions is hard to be related to the specific concerns and 
preferences of stakeholders, as well as operational constraints of a specific situation. 

The Goal-oriented Requirements Language (GRL) [7] was originally developed to 
describe users’ needs from a higher level of abstraction. It is an elaborated and evolv- 
ing version of a strategic actors modeling framework i* [12]. By using concepts such 
as actor, goal, softgoal, tasks, dependencies, resources, users’ requirements and spe- 
cific existing technology solutions can be explicitly related. Alternative solutions can 
be traded off based on their impacts to the requirements according to user set priori- 
ties. 

2.1 Dealing with Identities in GRL: Actor, Agent, Role, and Position 

The right hand side of Figure 1 shows the basic modeling constructs of GRL in terms 
of identity. Actor is a general concept that covers all kinds of intentional entities that 
is identifiable. However, not all identities were created equal. There are in fact at least 
three different types of identities, each having different attributes and each kind is 
likely to experience different adoption characteristics and market longevity. 




These three kinds of IDs are defined as follows: 

> Agent ( Personal ) Identity: is both timeless and unconditional. They are the true 
personal digital identity and are owned and controlled entirely by the individual, 
for his/her sole benefit. Agent identities exist for people as well as for devices & 
programs, with the exception that a device or program operates in Agent mode 
only, meaning, it is reflecting the identity and intention of another person who is 
controlling the device. 

> Position ( Organizational ) Identity: is both conditional and temporary in its issu- 
ance to the individual. We typically denote these identities as being assigned or is- 
sued, and they typically refer to the person in the context of a business relation- 
ship. For example, nearly every 'identity' we have with a business is a position 
identity, our job title, our cell phone, our air miles card, our social insurance are 
all Position IDs. These IDs comprise the bulk of the digital identities today. 




558 Lin Liu and Eric Yu 



> Role ( Attribute ) Identity: is abstract. This type of identities speaks to the way in 
which companies aggregate customers into different buckets for the purposes of 
advertising or communicating. For example, an agent is either a 'frequent buyer' or 
a 'one time customer' etc. Role ids are typically based upon the agents’ demo- 
graphics or behavior in their interactions with business. The entire CRM market 
caters to such identities. Role IDs usually has specific attributes or preferences at- 
tached. 

Also shown on the diagram are the different types of actor association links, 
amongst which, INS is used when an actor is an instance of another actor (class). ISA 
will be used if one actor (class) is the subclass of another actor (class). Plays will be 
used to connect an agent with the abstract roles it is playing. Occupies will be used to 
relate an agent with his/her official occupations represented by a position. Covers is 
used when one position is associated with multiple roles. With these links, we will be 
able to create a network structure for the players involved in a system with these 
links. This forms the basis of our identity management modeling framework. 

In traditional identity management, the concept of role is most widely understood 
in the context of role-based access control (RBAC) [10]. Access privileges on a sys- 
tem are grouped into roles, and users are attached to roles as a convenient mechanism 
to manage their privileges. 

In contrast, we are more interested in reasoning about a system entity’s intentions 
and capabilities. In managing identity across multiple systems, users may play multi- 
ple roles, each involving different intentions and implying different capabilities. Be- 
yond addressing issues of user classification, role definition and conflict resolution, 
we are able to explore system vulnerabilities and opportunities in the identity man- 
agement context. 

The left hand side of Figure 1 shows a typical set of related user identities. It al- 
lows for many different types of systems and relationships, depending upon the con- 
text. This type of model explores the implications and relationships of the multiple 
IDs used by an individual (John). From this example, we know that John has needs as 
a physical entity (Agent), he obtains different IDs/names for his different organiza- 
tional positions in order to satisfy the corresponding needs. His name as an Employee 
is “John_Public”, Citizen ID is “Johanson_Qwerty_Publico”, while Customer ID is 
“Johnny_Public”. As a Friend, he has many nicknames, such as, JoJo, Qwert, and Q-ball 
[8]. Sometimes, an ID is needed when applying for other ID e.g., to become an em- 
ployee, he has to be a legitimate citizen first. In the meantime, he is identified by Lottery 
Service as One Time Customer, while Travel Service as Frequent Customer. Linking those 
identities would have to be done based on certain predetermined principle: always ask 
for the consent of the individual, through a trusted authority, link within a circle of 
trust, etc. 

2.2 Proceed with a General Social Dependency Model 

of Identity Management - Why and How Do We Manage Identity? 

In the Internet service setting, a fundamental relationship between interactive parties 
is the user-provider relationship. In the social dependency model in Figure 2, the user 
of an Internet service and the provider of the service are represented as two position 
actors: User[Service] and Provider[Service]. The two actors depend on each other to 
fulfill their own goals. First of all, the user depends on the provider for a service, 
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which is represented as a goal dependency: Be Provided [Service], In the other direction, 
the provider depends on the user’s legitimacy as a user of the service, which is shown 
as a softgoal dependency Legitimacy [User [Service]]. The user may also prefer that the 
service be personalized, which is modeled as another softgoal dependency - Be Per- 
sonalized [Service], These two softgoal dependency relationships are the major motivat- 
ing requirements for identity management. 
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Fig. 2. A Generic Social Dependency (SD) model 



In GRL, dependencies are distinguished into four different types. As Goal depend- 
ency represents an abstract functional requirement of one actor imposed on another 
actor. For example. Be Provided [Service] is a functional goal dependency. Softgoal 
dependency represents an abstract non-functional requirement. Be Personalized[Service] 
is modeled as non- functional dependency since it doesn’t affect the system’s normal 
functioning, but they will have some subtle, indirect effects to the business. Other 
dependency relationships are task dependency, through which the depending party 
delegates a course of action to the depended party, and resource dependency, which 
means the depending party asks for the delivery of certain information entity or busi- 
ness asset from the depended party. Resource is data contained in an information sys- 
tem, or a service provided by a system, or a system capability such as processing 
power or communication bandwidth, or an item of system equipment, or a facility that 
houses system operations and equipment. 

Figure 2 illustrates some of the features of generic identity management architec- 
tures [8, 9], Here each User is a Principal, who can have multiple Credentials, which can 
be transferred or presented to establish a claimed principal identity. Principal uses 
Credential to sign-on to Credentials Collector, who Verifies Credential. Authentication Au- 
thority performs authentication of a principal at a particular time using a particular 
method of authentication, e.g. password, token, smart card, biometrics, etc. Attribute 
Authority verifies Attribute about a principal, based upon to-be-determined inputs. Ex- 
amples of attributes are: group, role, title, contract code, etc. Policy Decision Point 
makes decision as a result of evaluating the user’s identity, the requested operation, 
and the requested resource in light of applicable security policies. Policy Enforcement 
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Point enforces the decision made by access decision function upon each access control 
request. In this single service setting, all of the roles other than Principal are played by 
the service provider. So the dependencies between the roles are likely to be depend- 
able. However, this is not always the case. In the multiple service case to be discussed 
in section 2.3, the dependability of roles needs to be evaluated. 



2.3 Continue with a Social Dependency Model on Single Sign-On 

Using the generic model developed above, we now consider a more complex situa- 
tion. One of the most demanded features in identity management is Single Sign-On. 
In this kind of setting, a user authenticates with one Web service provider, then, is 
able to use a secured resource from another service provider, without further authenti- 
cation. 

Using GRL, we are able to explicitly describe what the social dependency relation- 
ships between different system entities are, so that configurations of the future system 
can be explored and traded off. Figure 3 is a representation of one possible mode to 
support Single Sign-On. In this case, the travel service provider (travel.com) will only 
provide discount ticket to an employee of company.com. Thus, each time it receives a 
service request, it will pull the authentication information from the Employer (played by 
company.com) based on IDs provided by the User (John). In this scenario, company.com 
acts as a Credentials Collector, Authentication Authority, and Attribute Authority. John is the 
Principal. Travel.com plays Policy Decision Point and Policy Enforcement Point. Comparing 
to the single service situation, there is distributed trust between Travel.com and Com- 
pany. com. In other words, Travel.com has to trust Company.com to authenticating the 
customer. 
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Fig. 3. Single Sign-On: A Social Dependency model 
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The key distinction of such a model from conventional ones is that from the distri- 
bution of roles, one can infer distribution of trust and dependency relationships. A 
special case is when multiple services are provided by different branches of the same 
company which trust each other, e.g. Bell residential phone service. Bell wireless 
phone service, Bell Sympatico high-speed internet service, and Bell ExpressVu satel- 
lite TV can share customer information with each other since they are operating 
within the same trusted jurisdiction. 

A variation on the above Single Sign-On model is that a third party security service 
provider (e.g., Microsoft .Net Passport, Novell DigitalMe) provides authentication 
assertions for the user. In this case, Microsoft as a centralized intermediary has many 
competitive advantages in terms of customer management (or control). At the same 
time, it also creates a vulnerable point of security, risk of privacy infringement, and 
issues of scalability due to its centralized nature. 

2.4 Reasoning Further upon the Generic Model - 
Revealing Potential Conflict of Interests 

Identity management as a business model and business enabler is fairly new and un- 
proven. On the one hand, it may bring great opportunities to organizations that own 
information, and the customers and partners who want to share such information. This 
also highlights the importance of adhering to emerging privacy standards and data 
regulations. Identity is at the core of online business and data transactions. It is inter- 
woven into most business processes, including granting access to information and 
systems, enabling customer relationships managements, and driving relationships 
with business partners and suppliers. Besides just presenting credentials for authenti- 
cated access to systems and services, an identity includes attributes that make for 
more targeted and productive use of the services. However, unwanted scenarios such 
as customer lock-in and privacy violations may also happen. We want to analyze 
these different viewpoints and conflicts of interest as well. 




Fig. 4. A SD model revealing potential conflict-of-interest 



Figure 4 shows that company.com has obtained the Personal Info of John through its 
capacity as a Provider [Servicel ]. Being a Marketer [Service2] in the meantime, com- 
pany. com may put this information into secondary use and recommend new services to 
its current customer. This is a beneficial identity management scenario from the ser- 
vice provider’s viewpoint. However, this may be frustrating or threatening to custom- 
ers who do not want to be the target of marketing activity. We represent such side 
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effects with a dotted link with a textual label (e.g. Dislike ). The focal point is high- 
lighted. A focal point can either be a link or an element, where the inconsistencies 
between different viewpoints, or conflicts of interest between different agents come to 
the surface. 

3 Design for Identity and Access Management - 
Towards a Systematic Approach 

In designing an identity management systems, some basic questions need to be an- 
swered: 

• Who is involved? i.e., what are the types of users the organization will manage? 

• What are the objectives of the stakeholders? i.e., what are the high-level concerns 
of the system players and system designers, particularly, the non-functional ones? 

• What are the resources/services to be accessed and controlled? 

• What kinds of ID are needed? 

• What kinds of ID management procedure or mechanism are needed? 

• What are the implications of each of these specific ID management solutions to- 
wards the high-level concerns? 

3.1 What Does the Stakeholder Have in Mind? - Strategic Rationales Modeling 

By building a social dependency model, we are able to analyze the inter-actor rela- 
tionships relevant to identity management. However, to better understand each stake- 
holder’s requirements, a strategic rationale (SR) model such as the one shown in Fig- 
ure 5 provides the desired details. 




Fig. 5. A Strategic Rationale model - Stakeholders’ intentions 



In a strategic rationale model, an actor’s high-level concerns are elaborated and re- 
fined into lower-level goals, operable tasks or concrete information entities. For ex- 
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ample, an agent John who plays service user (User[Service]) has the following require- 
ments: access the service (Access[Service]), and the service is of good quality (Qual- 
ity[Service]). Furthermore, quality of service can be elaborated into concerns on Secu- 
rity, Respect to Privacy, and Ease-of-Use, which are further refined into Availability, 
Confidentiality, and Be Personalized. On the functional side, Access[Service] is operation- 
alized into the tasks to be performed by the user: Provide Personal Info, and Present 
[ID[Service]]. In order to provide personalized service, John needs to Give Preferences 
for the desired Service. 

Similar analysis happens on the provider side. This refinement process goes on un- 
til sufficient information is obtained. Thus, the previously isolated external dependen- 
cies in the SD model can be interconnected through the hierarchical goal refinement 
structures within the actor boundaries of the SR model. 

3.2 Competing Non-functional Requirements and Alternative Solutions 

Having understood the concerns of the agents involved, we may now begin to look for 
design solutions. In Figure 6, we combine the requirements of the service user and the 
provider. Knowledge from a system designer viewpoint are also added - for example, 
knowledge about the various building blocks of an identity management system for 
Internet service. In the figure we see that, there are competing non-functional re- 
quirements: Security, User Productivity, Privacy, Cost Reduction, Accountability, Ease-of- 
use, and so on. 

The major modeling constructs used include goals, softgoals and tasks. They are 
linked together with means-ends links (— C= — ), decompositions links ( — h), and con- 
tribution links (— >). In the design modeling setting, means-ends are used to indicate 
alternative techniques, while decompositions are used to identify necessary compo- 
nents. Contributions are used to represent the impacts to a softgoal. An impact can be 
positive or negative; partial, sufficient, or unknown. 

In figure 5, building blocks of identity management systems are modeled. This 
kind of knowledge is usually obtained from domain experts [3], To generate a satis- 
ficing design, the impacts of the chosen solution to these non-functional requirements 
represented with softgoal need to be evaluated and traded-off. Here we use a qualita- 
tive approach to represent and propagate the impacts across the graph. Details about 
the evaluation algorithm are in [2] . Then by combining the favorable alternatives from 
each design decision making process, we obtain a design blueprint. 

4 Discussion and Conclusion 

We have outlined an approach for modeling and analyzing identity management is- 
sues. The approach is based on an intentional and social ontology that is centered on 
strategic actor relationships. This ontology allows us to go beyond entity relationships 
and mechanistic behavior, to deal with the opportunistic behavior of strategic actors. 
Interdependencies among actors place constraints on their freedom of action. Never- 
theless, constraints can be violated due to agent autonomy (unlike in mechanistic 
systems) as in the conflict-of-interest example. Strategic actors seek to achieve goals 
(hard and soft) by obtaining new identities from service providers, taking into account 
the opportunities and vulnerabilities arising from various dependency relationships, as 
illustrated in the generic identity modeling example. 
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Our approach is complementary to existing frameworks and techniques for identity 
management. Recent approaches have emphasized taking a systematic and holistic 
view towards identity and security management processes [11] and the business value 
that identity management can bring [1, 3]. There is also increasing use of new tech- 
nology and business models, e.g., single sign-on, third-party intermediaries in authen- 
tication or trust [1 1. Our approach emphasizes the systematic analysis of relationships 
among strategic actors and their intentions by extending conceptual modeling tech- 
niques. It supports the exploration and management of structural alternatives, based 
on a balanced consideration of all competing requirements, thus complementing the 
various point solutions of recent identity management techniques. 

Identity management is increasingly connected with other activities in enterprise 
management. The strategic modeling approach provides a way of linking identity 
related analysis to business strategy analysis and technology analysis. An intentional 
conceptual modeling approach can thus provide a unifying framework for enterprise 
information systems, supporting decision making and the management of change 
across technical system development, business development, and identity and security 
management [16]. 
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In information systems and software engineering research, organizational modeling 
has been of interest, often in connection with requirements engineering. Goal-oriented 
approaches have been used in this context, and agents or actors are often part of the 
modeling ontology [4, 5], However, the GRL approach is distinctive in its treatment 
of agents/actors as being strategic [12], and thus readily adaptable to the identity 
management analysis domain illustrated in this paper. A related technique was used 
earlier to model intellectual property in a business analysis setting [14]. 

While this paper has outlined some basic modeling concepts, much remains to be 
done. There is much potential in the synergy between strategic modeling and the 
foundational principles in conceptual modeling. For example, in analyzing the impli- 
cations of an identity, one would like to model the inter-relatedness among their sub- 
ject matters. The interaction between intentional concepts and relationships (e.g., 
strategic actors, intentional dependencies) and non-intentional ones (e.g., processes, 
information assets, time, etc. ) need to be detailed. Libraries of reusable design knowl- 
edge about identity management would be very helpful during modeling and analysis. 
These are topics of ongoing research. 
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Abstract. In this paper, we develop a novel Web Usage Manipulation 
Language (WUML) which is a declarative language for manipulating 
Web log data. We assume that a set of trails formed by users during 
the navigation process can be identified from Web log files. The trails 
are dually modelled as a transition graph and a navigation matrix with 
respect to the underlying Web topology. A WUML expression is executed 
by transforming it into Navigation Log Algebra (NLA), which consists of 
the sum, union, difference, intersection, projection, selection, power and 
grouping operators. As real navigation matrices are sparse, we perform 
a range of experiments to study the impact of using different matrix 
storage schemes on the performance of the NLA. 



1 Introduction 

The topology of a Web site is constructed according to the designers’ conceptual 
view of the Web information. There may be a mismatch, however, between the 
users’ behavior and the expectation of the designers. Therefore, we propose and 
develop a query language on Web log data. We assume that the Web usage 
information can be generated from log files through a cleaning process [11,4]. 
Based on our earlier work in [11], we model a collection of user sessions on 
a given Web topology as a weighted directed graph, called a transition graph. 
A corresponding navigation matrix is then computed using knowledge of the 
underlying Web topology. 

Herein, we extend the four basic operators of sum, union, intersection and 
difference from [11] on valid navigation matrices to a more comprehensive set 
of operators: sum, union, intersection, difference, projection, selection, 
power and grouping. We call these operators collectively the Navigation Log 
Algebra (the NLA). Navigation matrices generated from real Web log files are 
sparse. We therefore carry out a spectrum of experiments to study the per- 
formances of the NLA operators on synthetically generated navigation matrices. 
The results indicate that the storage schemes affect the performances differently. 

* This work is supported in part by grants from the Research Grant Council of Hong 
Kong, Grant Nos DAG01/02.EG05 and HKUST6185/02E. 
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To gain better insight regarding navigation behavior from the log data, we 
further develop a novel declarative language termed Web Usage Manipulation 
Language ( WUML ), which allows users to specify queries on navigation matri- 
ces. The WUML expressions are implemented as a sequence of NLA operations. 
Using WUML, a Web designer is able to better understand navigation details 
over the site structure based on the analysis of querying results. For example, 
the overall usage of the site can be generated by a WUML query that combines 
all user categories, while the deviation analysis can be generated by a WUML 
query that finds out the contrast between the designer’s expectation and user 
navigation behaviors. WUML also enables the designers to view the overall per- 
formance of a set of closely-related pages using grouping. 

Our Contributions. (1) We define the NLA on navigation matrices, which are 
a comprehensive set of operators of sum, union, intersection, difference, 
projection, selection, power and grouping. (2) We propose a validation al- 
gorithm called VALID, which is able to preserve the validity of the navigation 
matrix when executing some of the operations. Essentially, this is to avoid iso- 
lated sets of pages happening in NLA operations. (3) We develop a novel declar- 
ative language WUML on navigation matrices. WUML can be transformed into 
a corresponding sequence of operations. (4) We study three different storage 
schemes to deal with the sparse nature in navigation matrices. By carrying out 
a spectrum of experiments on synthetic Web log data, we clarify the effects of 
storage schemes on individual NLA operators. 

Related Work. There has been a lot of research related to applying data mining 
techniques on the Web log data. A mass of Web usage mining tools [1,2,5, 
14, 13,7] have been developed to help the designers improve Web sites, attract 
visitors, or provide users with a personalized and adaptive service. Several mining 
languages, such as WUM’s MINT [13] and WEB MINER’S query language [7], 
are also proposed for these objectives. However, these languages are based on 
mining techniques for association rules and sequential patterns. Our WUML is 
developed to specify a query which is sufficiently expressive to query log data 
represented as a transition graph, or equivalently a navigation matrix. 

The rest of this paper is organized as follows. In Sect. 2, we give preliminary 
definitions related to Web usage analysis. NLA for the navigation matrices is 
discussed in Sect. 3. We propose WUML and discuss the transformation of a 
WUML expression into an NLA expression in Sect. 4. In Sect. 5, three storage 
schemes for navigation matrices are introduced. A set of experimental results on 
the NLA using different storage schemes are analyzed in Sect. 6. Finally, we give 
our concluding remarks in Sect. 7. 

2 Preliminaries 

A Web topology W is defined as a directed graph, in which each node represents 
one Web page and each directed link represents the hyperlink between pages. A 
user session is a sequence of page requests from the same user such that no two 
consecutive requests are separated by more than X minutes. In a user session, 
two consecutive pages should have a link in the Web topology. 
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A transition graph is a weighted directed graph constructed from W by 
adding two special pages: the starting page S and the finishing page F. Given a 
set of user sessions, we define the weight of the link from S to any other page as 
the number of times that the page is first requested. Similarly, the weight of the 
link from any page to F is the number of times that the page is last requested. As 
for links between any other two pages in W, called internal links, the weight is 
the number of times that two pages appear as consecutive pages in user sessions. 
We dually model a transition graph as a navigation matrix [11]. A navigation 
matrix is defined as the adjacency matrix of a transition graph. A one-to-one 
correspondence exists between transition graphs and navigation matrices. 

Now, we discuss the notion of validity of a transition graph. A node in a 
transition graph is said to be balanced if the total weights of its in-links equal 
to the total weights of its out-links. And the node degree is the total weights 
of in-links and out-links. A transition graph is said to be valid if it satisfies the 
following four conditions. (1) In-degree of S, out-degree of F and the link weight 
from S to F are zero; (2) Every internal link having non-zero weight is also a 
link in the Web topology, (Note that this excludes self-looping in the graph.); 
(3) Every node except S and F should be balanced; (4) Every node which has 
non-zero degree should be reachable from S. 

The validity of the navigation matrix is equivalent to the validity of the 
transition graph. That is, a navigation matrix is said to be valid if and only if its 
corresponding transition graph is valid. As we can see in the subsequent sections, 
after execution of some NLA operators, such as difference, intersection and 
power, the output navigation matrix may not be valid. There may be nodes 
with non-zero degree which cannot be reached from S. However, the validity in 
outcomes is essential since it ensures that the operations continue in a procedural 
manner. It is thus necessary to guarantee the output navigation matrix is valid. 

We outline an algorithm, VALID, as shown in Algorithm 1, which employs 
DFS strategy, to keep the validity of a navigation matrix. The input of the algo- 
rithm is a navigation matrix whose nodes are balanced. Using VALID, connected 
components of the transition graph can be found. If there is more than one com- 
ponent, we add a link from S to the root of that component and also a link from 
that root to F to keep it balanced. The output matrix is then valid since all 
pages with non-zero degree are reachable from S. 

Note that VALID needs three extra arrays (namely, color, parent and tag), 
each with size of n+2, the space complexity is O(n). The time determining steps 
are the two nested loops over n+2 in Lines 1 and 3 in Algorithm 1, which takes 
0(n 2 ) time. As for the execution of the DFS- VISIT procedure, it takes 0{E) 
time, where E is the number of edges in the transition graph corresponding to 
the input navigation matrix. In the worst case, E equals to (n+2) 2 . To conclude, 
the time complexity for running VALID is 0(n 2 ). 

3 Navigation Log Algebra 

In this section, we define the Navigation Log Algebra (the NLA). For more 
illustrated examples on using some operations, the readers may refer to [11]. 
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Algorithm 1 VALID Algorithm 
Input: Matrix M with n + 2 dimensions 

VALID (M) 

1. FOR all 0 < i < n + 1 //Initialization 

2. DO color\i\ := WHITE: parent[i] := NIL; tag\i] := 0; 

3. FOR all 0 < j < n + 1 

4. DO IF M[i]\j] != 0 

5. THEN tag\i\ := 1; break :/ /tag indicates pages with non-zero degree 

6. ta</[0] := 1: //Make sure DFS starts from page S 

7. FOR all 0 < i < n + 1 //DFS: Find connected components 

8. DO IF tag[i\ = 1 and color[i ] = WHITE 

9. THEN DFS-VISIT(i); 

10. FOR all 1 < i < n 

11. DO IF tag[i ] = 1 and parent[i] = NIL //Ensure validity by adding 

12. THEN M[0][i\ += 1; M[i][n + 1] += 1; //links from S and links to F 

DFS-VISIT(i) //Recursively search to form connected components 

1. color[i\ := GRAY 

2. FOR all 0 < j < n + 1 

3. DO IF M(i]\j\ != 0 and c olor\j] = WHITE 

4. THEN parent [?] := i; DFS-VISIT(j); 

Output: The valid matrix of M 



Sum. The sum of two navigation matrices Mi and M2, denoted as Mi + M2, 
is defined as a navigation matrix M3 such that, for all i,j € {0, . . . , n + 1}, 
(ay) 3 = (ay) 1 + (ay) 2. Actually, sum is exactly the generic sum of two matrices. 
Trivially, the outcome M 3 remains to be valid. 

Union. The union of two navigation matrices Mi and M2, denoted as Mi 1 J Af 2 , 
is defined as a navigation matrix M 3 such that, for all i,j £ {1, . . . , ?r}, (Uy) 3 = 
m.ax((ay) 1, (ay) 2 ); (aoj )3 =max((a 0 j )i, (a 0 j)2)+max(0,J2lti rnax (( a jk)i, 
(ajk) 2) - YJk=o max (( a kj)i, ( a kj)2)); (ai(n+ 1 )) 3 =maa:((a i („ + i))i,(aj(„ + i)) 2 )+ 
max( 0, ]Pfc= 0 max((a k i) 1, (a fc ,;) 2 ) - J 2 kti max (( a ik)i, (aik) 2)); and all other el- 
ements are zero. For union, we do not need to use VALID because the max 
function used in union is able to maintain the reachability of nodes from S. 

Difference. The difference of two navigation matrices Mi and M 2 , denoted 
as Mi — M 2 , is defined as a navigation matrix M 3 such that, for all i,j £ 
{1, . . . ,n}, (ay) 3 = max(0, ((ay)i-(ay) 2 ))-, (a 0j ) 3 = maa;(0, ((a 0 j)i-(a 0j ) 2 )) + 

max(0,J2ktl m ax(0,((ajk)i - (a jk ) 2 )) - YJk=o 7nax(0, ((a k j)i - (a k j) 2 ))); 
(aj(n+i ))3 = TOax( 0 ,((a i(n+ i))i - (a i(n+ i)) 2 )) + maa;( 0 , X)*=o maa; ( 0 > ((«fei)i - 
(afci) 2 )) — max( 0 , ((a^k) 1 — (<2^)2))); and all other elements are zero. As 

the result may be invalid, we run VALID after executing the above operation. 

Intersection. The intersection of two navigation matrices Mi and M 2 , de- 
noted as MiP)M 2 , is defined as a navigation matrix M 3 such that, for all 
i,j £ {1 ,...,n},(ay) 3 = min((aij)i,(aij) 2 )-, (a 0 j) 3 = min((a 0 j)i, (ooj) 2 ) + 
m.ax(0,J2ltlmin((aj k ) 1 ,(a jk )2) - J2k=o m in(( a kj)i, (akjh))', (o»(n+i))3 = 
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TOm((a i( „ + i))i, (aj(„ + i)) 2 ) + max(0, J2k=o min((aki)i, (a/d) 2 ) - J2ktl "tin 
((aifc)i, (aifc) 2 )); and all other elements are zero. We run the VALID algorithm 
on the intermediate answer from executing intersection. 

Projection. The projection of one navigation matrix M i, denoted as Y\p{Mi) 
where P is a set of Web pages, is defined as a navigation matrix M 3 such 
that, for all P%,Pj € P, ( 0^)3 = (o*j)i; ( 001)3 = (aoj)i + Sfc=i,...,n;P fc £p( a k?)i) 
K(n+ 1))3 = ( a i(«+i))i + X^=i,..., ra ;p fe £p( a ifc)t; and all other elements are zero. 
Projection does not need the VALID algorithm. 

Selection. The selection of one navigation matrix Mi, denoted as <Jg x {M\) f 
where 9 is a comparator such as >, <, >=, <=, or =, and a; is a positive in- 
teger, is defined as a navigation matrix M 3 such that, for all i,j € {1, . . . , n}, 
if{p-ij)lQx,(aij')z = (Ojj)i; (aoj)3 = ( a 0j)l + Sfc=l,...,ra;(a fe j)i0x( a k?)li ( a i(n+l))3 
= («i(n+i))i + Sfc=i n-(a ik ) 1 §x( a ik) 1 ; and all other elements are zero. The nota- 

tion 9 denotes the complement of 9 (e.g. 9 is “<=” when 9 is “>”). The output 
is already valid, so it does not need to be processed through VALID. 

Power. The power of one navigation matrix Mi, denoted as {Mi) x , where x is 
a non-negative integer, is defined as a navigation matrix M 3 such that M 3 = 
0(n+2)x(n+2) if x = 0; M 3 = Ml if x = 1; and M 3 = Mi • (MJ 1-1 if x > 2. 
Herein, the operator denotes the multiplication of two matrices. The result 
M 3 may not be valid, since we need to ignore the possible non-zero values in 
its diagonal and at (ao(n+i))3- We should maintain the pages balance and run 
VALID. The semantics for non-zero entry (a.jj ) 3 in M 3 = ( Mi) x is that there is 
a trail from Pi to Pj with the length of x in Mi. If ( djj) 3 is large, it indicates 
that many users have traversed from Pj to Pj . If there is no link from Pj to Pj in 
the Web topology, the designer may add this link to facilitate better navigation. 

Grouping. The grouping of a navigation matrix Mi, denoted as Gp(Mi), 
where P is a set of pages in W, returns a navigation matrix M 3 which is an 
aggregated view of Mi. Grouping groups all the pages in P as a single page in 
M 3 by ignoring the links within the pages in P and adding up the weight of links 
from and to the outside pages, respectively. By grouping a set of pages which are 
closely related in semantics, we are able to understand the navigation in terms 
of information units [9]. Our approach is to view grouping as an aggregation of 
log information. Therefore, we do not define ungrouping here, since ungrouping 
introduces uncertainty in log information which needs further study. 

The following properties follow from the definitions of the NLA operators. 
We will see later these properties pave the way for choosing an efficient execution 
plan for a given WUML expression. We only state the following properties where 
we assume (Si € {+,U,D} and 62 € {U, fl}, since the proof is straightforward. 

Associative Property. ( M-, Ai M 2 ) (Si M 3 = Mi 61 (M 2 (Si M 3 ). 

Commutative Property. Mi (Si M 2 =M 2 <Si Mi; np(ag x (M))=ag x (IIp(M)) 

Distributive Property. IIp(Mi Si M 2 ) = IIp(Mi) Si 77p(M 2 ); Gp(Mi (Si M 2 ) 
= Gp(Mi) (Si Gp(M 2 ); and cr$ x (Mi (S 2 M 2 ) = ag x (Mi) 82 crg x {M 2 ). 
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4 The Web Usage Manipulation Language 

We now introduce a declarative language on navigation matrices termed the 
Web Usage Manipulation Language (WUML) . A WUML expression is executed 
via the NLA operators defined in Sect. 3, which shares the same principle of 
translating a SQL expression into a sequence of relational algebra operations. 
We now define the WUML syntax in Backus Naur Form (BNF): 

<query> :: = <selectClause><fromClause>[<conditionClause>][<groupClause>] 
<selectClause> :: = SELECT <pageList> 

<queryList> :: = query [, query. . .] 

<fromClause> :: = FROM <matrixldentifier> | FROM <operator> <matrixList> 
<conditionClause> :: = WHERE LINKWT <compOp> integer 
<groupClause> :: = GROUP BY <pageList> 

<pageList> :: = pageldentifier [, pageldentifier. . .]j* 

<matrixList> :: = matrixldentifier [, matrixldentifier. . .] 

<operator> :: = SUM|UNION|DIFF|INTERSECTjPOWER integer 
<compOp> :: = | <= | >= | < | = 

There are four main clauses in a query expression: the select , from, condition, 
and group . Among them the select and the from clauses are compulsory, while 
the condition and the group clauses are optional. Similar to SQL, WUML is a 
simple declarative language but is powerful enough to express query on the log 
information stored as navigation matrices. 

We execute a WUML expression by translating it into a sequence of NLA 
operations using Algorithm 2. Suppose V = {P-\,P‘ 2 , ... ,Pt} is a set of l Web 
pages in the select clause, M = {M\,M 2 , ■ ■ ■ , M m } is a set of m matrices in the 



Algorithm 2 TRANS Algorithm 

Input: A WUML expression q 

LET q = <selectClause><fromClause>[<conditionClausc>][<groupClause>] 
<select Clause > := “SELECT V | *” 

<fromClause> := “FROM OPER ,Vf” 

<conditionClause> := “WHERE LINKWT Ox" 
<groupClause> := “GROUP BY P G ” 

Procedure: 

Step 1 : For the frornClause, CASE OPER, OF: 
t : TEMP := “Mi” 

SUM : TEMP := “Mi + M 2 + ■ ■ ■ + M m ” 

UNION : TEMP := “M, |J M 2 U ' U M m” 

DIFF : TEMP := “Mi - M 2 M m ” 

INTERSECT : TEMP := “Mi f) M 2 f| ■ • • fl M m” 

POWER a : TEMP := “(Mi) Q ” 

Step 2 : For the selectClause: 

IF V’ THEN TEMP := TEMP 

ELSE TEMP := “II v {TEMPy 

Step 3 : IF there is a whereClausc THEN TEMP := u <t 9x (TEMP)” 
Step 4 : IF there is a groupClause THEN TEMP := “G Pn {TEM P)" 
Output: TEMP expression 
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Fig. 1. WUML query tree 




Fig. 2. Optimized WUML query tree 



from clause, Pq = {Pi, Pi, ■ ■ ■ , P n } is a set of n Web pages in the group clause, 
OPER €{e, SUM, UNION, DIFF, INTERSECT, POWER a }, x and a 
are two non-negative integers. Note that the input WUML query expression is 
assumed to be syntactically valid. If OPER = e or POWER a, then rn = 1. 
Fig. 1 depicts a query tree which illustrates the basic idea. 

Now, we present a set of examples, which illustrates the usage of the WUML 
expressions and the translation into the corresponding sequence of NLA opera- 
tions. Let M, Mi, Mi and M 3 be navigation matrices. 

Q 1 : We want to know how frequently the pages Pi and P 2 were visited. 

WUML expression'. SELECT Pi, P 2 FROM SUM Mi, M 2 . 

NLA operation: IT{p 1> p 2 }(Mi + M 2 ). 

Q 2- We want to find out the essential difference of preferences between the two 
groups of users in M\ and M 2 . We consider those links having the weight > 

3. 

WUML expression: SELECT * FROM DIFF Mi, M 2 WHERE LINKWT 
> 3. NLA operation: ct> 3 (Mi — M 2 ). 

Qs" We want to get the navigation details in an information unit [9] consisting 
of the pages Pi, P 2 , and P 3 . We may gain insight to decide whether it is 
better to combine these three Web pages or not. So we consider them as a 
group. 

WUML expression: SELECT * FROM SUM Mi, M 2 GROUP BY Pi,P 2 ,P 3 . 
NLA operation: G{p x ,p 2 ,p 3 }(Mi + M 2 ). 

Q 4= We want to know whether some pages were visited by users after 3 clicks. 
If they were seldom visited or lately visited in a user session, we may decide 
to remove or update them to make them more popular. 

WUML expression: SELECT Pi, P 2 , P 3 FROM POWER 3 M. 

NLA operation: ^7{p 1 ,p 2 ,p 3 }(M) 3 . 

Q 5 : Now we want to get the specific information of a particular set of Web 
pages. 

WUML expression: SELECT Pi, P 2 , P 3 , P±, P 5 FROM INTERSECT Mi, 
M 2 , M 3 WHERE LINKWT > 6. 

NLA operation: cr >6 (I7 { p li p 2) p 3) p 4) p 5 }(Mi f| M 2 f| Ms))- 
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Let us again consider the query, Q 5. We will see in Sect. 6 that the running 
time of NLA operators are proportional to the number of non-zero elements in 
the executed matrix. Therefore, the optimal plan is to first execute the NLA 
operators which can minimize the number of non-zero elements in the matrix. 
For the sake of efficiency, the projection (77) should be executed as early as 
possible. So a better NLA execution plan of Q 5 can be obtained as follows: 

Qe ■ o->6(77 { p li p 2i p 3i p 4i p 5 }(Mi) P| /7 { p li p 2i p 3i p 4i p 5 }(M 2 ) P 77{p 1 ,p 2i p 3 ,p 4 ,p 5 }(M 3 )). 

We now summarize some optimization rules as depicted in Fig. 2. First, 
projection should be done as early as possible since it can eliminate some non- 
zero elements. Note that projection is not distributive under difference and 
power. Second, since selection is not distributive under some binary operators 
such as difference, we do not change the execution order. Finally, grouping 
creates a view different from the underlying Web topology. Therefore, it should 
be done at the last step except some operators taking another navigation matrix 
whose structure is the same as the grouped one. Note that these rules are simple 
heuristics to sort NLA operations. We still need to find out a more systematic 
way to generate an optimized execution plan for a given WUML expression. 

5 Storage Schemes for Navigation Matrices 

As the navigation matrices generated from the Web log files are usually sparse, 
the storage scheme of a matrix greatly affects the performance of WUML. In 
this section we introduce three storage schemes, COO, CSR, and CSC, to study 
their impacts on the NLA operations. 

In literature, the technique of storing sparse matrices has been intensively 
studied [3,8]. In our WUML environment, we store the navigation matrix as 
three separate parts: the first row (i.e. the weights of the links starting from S), 
the last column(i.e. the weights of the links ending in F) and the square matrix 
despite the rows and columns of S and F. We employ two vectors, S vec t or and 
F ve ctor , which contains an array for the non-zero values as well as corresponding 
indices, to store the first row and the last column. Table 2 and 3 show examples 
using the matrix in Table 1. As for the third part, we implement the storage 
schemes proposed in [8]. We illustrate the schemes using the matrix in Table 1. 

The Coordinate (COO) storage scheme is the most straightforward structure 
to represent a sparse matrix. As illustrated in Table 4, it records each nonzero 
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Table 4. COO Scheme 
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entry together with its column and row index in three arrays in row- first order. 
Similar to COO, the Compressed Sparse Row (CSR) storage scheme also con- 
sists of three arrays. It differs from COO in the Compressed Row array which 
stores the location of the first non-zero entry in that row. Table 5 shows the 
structure of CSR. The Compressed Sparse Column (CSC) storage scheme, as 
shown in Table 6, is similar to CSR. It has three arrays: Nonzero array to hold 
the non-zero values in column-first order, Compressed Column array to hold 
the location of the first non-zero entry of that column, Row array for the row 
indices. CSC is the transpose of CSR. There are also other sparse matrix stor- 
age schemes, such as Compressed Diagonal Storage (CDS) and Jagged Diagonal 
Storage (JDS) [12]. However, they are used for storing a banded sparse matrix. 
In reality, the navigation matrix should not be banded. Therefore, these schemes 
are not studied in our experiments. 
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6 Experimental Results and Analysis 

We carry out a set of experiments to compare the performances of the three 
storage schemes introduced in Sect. 5. We also study the usability and efficiency 
of WUML on different data sets. The data set we used is a set of synthetic Web 
logs on different Web topology, which are generated by a log generator designed 
in [10]. The parameters used to generate the log files are described in Table 7. 
Among these four parameters, PageNum and MeanLink are dependent on the 
underlying Web topology while the other two are not. These experiments are 
run on Pentium 4, 2.5GHz, and 1G of RAM machine configuration. 



Table 7. Parameters for Data Set 



LogSize 


The number of log records in a log file. 


UserNum 


The number of users traversing the Web site. 


PageNum 


The number of pages in the Web topology. 


MeanLink 


The average number of links per page in the Web topology. 



6.1 Construction Time of Storage Schemes 

We choose three data sets: D\ = (2500, 1500, 1500, 3), D 2 = (5000, 3000, 3000, 5) 
and D 3 = (10000,6000,6000,10), in which the components represent the pa- 
rameters LogSize, UserNum, PageNum and MeanLink, respectively. Then we 
construct three storage schemes based on the generated log files from D i to D 3 . 
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Our measurement of the system response time includes I/O processing time and 
CPU processing time. As shown in Fig. 3, the response time grows significantly 
as the parameters increase. Since most of the time is consumed in reading the 
log files, the construction time for the same given data set varies slightly among 
the three storage schemes. But it still takes more time to construct COO than 
the other two schemes, since there is no compressed array for COO. CSC needs 
more time than CSR because the storage order in CSC is column-first while 
reading in the file is in row-first order. 
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Fig. 3. Construction Time 



Fig. 4. Running Four Operators 



6.2 Processing Time of Binary Operators 

We present the CPU processing time results of four binary operators: sum, union, 
difference and intersection. Each time we tune one of the four parameters 
to see how the processing time changes on COO, CSR and CSC storage schemes. 
For each parameter, we carry out experiments on ten different sets of Web logs. 
We first compare the processing time of each single operator under different 
storage schemes. Then we present the processing time of each storage scheme 
under different operators. 

Tuning LogSize. We set UserNum and PageNum as 3000, MeanLink as 5. The 
results are shown in Fig. 5. When LogSize increases, the processing time of the 
same operator on each storage scheme also increases. The reason is that the 
number of non-zero elements in the navigation matrix grows with the increase 
of LogSize, and therefore it needs more time to do the operations. 

Tuning PageNum. We set LogSize as 5000, UserNum as 3000, and MeanLink 
as 5. The results are presented in Fig. 7. With the growth of PageNum, the CPU 
time for each operator on specified storage scheme grows quickly. It is because 
PageNum is a significant parameter when constructing the navigation matrix. 
The more pages in the Web site, the larger dimension of a navigation matrix, 
and consequently, the more time needed to construct the navigation matrix. 
Tuning UseNum. Figure 6 shows the results when LogSize is 5000, PageNum 
is 3000 and MeanLink is 5. The processing time remains almost unchanged 
when UserNum grows. The main reason is that, although different user may 
have different behavior when traversing the Web site, the number of non-zero 
elements in navigation matrix is roughly the same due to the fixed LogSize. 
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Fig. 5. The CPU time by tuning LogSize 



Tuning MeanLink. We use the log files with LogSize of 5000, UserNum of 3000 
and PageNum of 3000. The results are shown in Fig. 8 which indicates that, with 
the increase of MeanLink, the processing time decreases. 

Note that for sum, COO always outperforms others, while CSR and CSC 
perform almost the same (see Fig. 5(a), 6(a), 7(a) and 8(a)). The similar phe- 
nomenon can be observed in Fig. 5(d), 6(d), 7(d) and 8(d) for intersection. 
As shown in Fig. 5(b), 6(b), 7(b) and 8(b), the processing time for union on 
three storage schemes has no significant difference. Finally, from Fig. 5(c), 6(c), 
7(c) and 8(c), the performances of CSR and CSC are much better than COO for 
difference. Note also that from Fig. 4, difference requires the most process- 
ing time, and sum needs the least. The Web logs used are of 5000 LogSize, 1000 
UserNum, 3000 PageNum and 5 MeanLink. The reason for this result is as fol- 
lows. As we have mentioned, we do not need to check the balance of Web pages 
and the validity of the navigation matrix for sum. Therefore, it takes the least 
time. For union, we only need to check the balance of Web pages without check- 
ing the validity of the output matrix. But for difference and intersection, 
we have to check both the page balance and matrix validity, which is rather 
time-consuming. It can be found that intersection does not need much time 
since there are very few non-zero elements in the output matrix. 
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Fig. 6. The CPU time by tuning UserNum 



6.3 Performance of Unary Operators 

Power. We use log files with 5000 LogSize, 3000 UserNum, 5 MeanLink, (100, 
500, 1000) PageNum. Each matrix multiplies twice (i.e. power = 2). As shown in 
Fig. 9, COO performs much worse than CSR and CSC. We also see that power 
is a rather time-consuming operator. 

Projection and Selection. Since projection and selection are commuta- 
tive, we study the time cost by swapping them on the navigation matrix with 
5000 LogSize, 5000 PageNum, 3000 UserNum and 5 MeanLink. As shown in Fig. 
10, doing projection before selection is more efficient than doing selection 
and then projection. According to this result, we can do some optimization 
when interpreting some queries. Moreover, COO outperforms CSR and CSC. 

From the experimental results shown above, we have the following observa- 
tions. First, from construction point of view, CSR is the best. Second, COO is 
the best for sum and intersection. Third, CSR and CSC perform almost the 
same for difference and power, and greatly outperform COO. Finally, COO, 
CSR and CSC perform the same for union. Taking these observations into con- 
sideration, CSR is the best for our WUML expressions. Although COO performs 
better in sum and intersection, it needs too much time for difference which 
is intolerant. Although the performance of CSC is the same as CSR with respect 
to the operations, CSC needs more time to be constructed. We also observe that 
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Fig. 7. The CPU time by tuning PageNum 



the time growth for each operator is linear to the growth of parameters, which 
indicates that the usability and scalability of WUML is acceptable in practice. 



7 Concluding Remarks 

We presented NLA which consists of a set of operators on navigation matrices 
and proposed an efficient algorithm VALID (0(n) space and 0(n 2 ) time com- 
plexities) to ensure the validity of an output matrix by NLA operators. Within 
NLA, we develop a query language WUML and study the mapping between the 
WUML statements and NLA expressions. To choose an efficient storage scheme 
for the sparse navigation matrix, we carried out a set of experiments on different 
synthetic Web log data sets, which are generated by tuning different parameters 
such as the number of pages, the number of mean links and the number of users. 
By the experimental results on three storage schemes of COO, CSC and CSR, we 
can see that the CSR scheme is relatively efficient for NLA. As for future work, 
we plan to develop a full-fledged WUML system to preform both analyzing and 
mining the real Web log data sets. We are also studying a more complete set of 
optimization heuristic rules for the NLA operators in order to generate a better 
execution plan for an input WUML expression. 
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Fig. 8. The CPU time by tuning MeanLink 




Fig. 9. Power (in log scale) 



Fig. 10. Projection and Selection 
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Abstract. The emerging paradigm of web services promises to bring 
to distributed computing the same flexibility that the web has brought 
to the publication and search of information contained in documents. 
This new paradigm puts severe demands on composition and execution 
of workflows that must survive and respond to changes in the computing 
and business environments. Workflows facilitated by web services must, 
therefore, allow dynamic composition in ways that cannot be predicted 
in advance. Utilizing the notions of shared mental models and proactive 
information exchange in agent teamwork research, we propose a solution 
that interleaves planning and execution in a distributed manner. This pa- 
per proposes a generic model, gives the mappings of terminology between 
Web services and team-based agents, describes a comprehensive archi- 
tecture for realizing the approach, and demonstrates its usefulness with 
the help of an example. A key benefit of the approach is the proactive 
failures handling that may be encountered during execution of complex 
web services. 



1 Introduction 

The mandate for effective composition of web services comes from the need to 
support complex business processes. Web services allow a more granular specifi- 
cation of tasks contained in workflows, and suggest the possibility of gracefully 
accommodating short-term trading relationships, which can be as brief as a single 
business transaction [1], Facilitating such workflows requires dynamic composi- 
tion of complex web services that must be monitored for successful execution. 
Drawing from research in workflow management systems [2], the realization of 
complex web services can be characterized by the following elements: (a) creation 
of execution order of operations from the short-listed Web services; (b) enact- 
ing the execution of the services in the sequenced order; and (c) administrating 
and monitoring the execution process. The web services composition problem 
has, therefore, been recognized to include both the coordination of sequence of 
services execution and also managing the execution of services as a unit [3] . 

Much current work in web service composition continues to focus on the first 
ingredient, i.e. discovery of appropriate services and planning for the sequencing 
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and invocation of these web services [4], Effective web services composition, how- 
ever, must encompass concerns beyond the planning stage, including the ability 
to handle errors and exceptions that may arise in a distributed environment. 
Monitoring of the execution and exception handling for the web services must, 
therefore, be part of an effective strategy for web service composition [5]. 

An approach to realizing this strategy is to incorporate research from intelli- 
gent agents, in particular, team-based agents [6]. The natural match between web 
services and intelligent agents - as modularized intelligence - has been alluded 
to by several researchers [7]. The objective of the research reported in this paper 
is to go a step further: to develop an approach for interleaved composition and 
execution of web services by incorporating ideas from research on team-based 
agents. In particular, an agent architecture will be proposed such that agents 
can respond to environmental changes and adjust their behaviors proactively to 
achieve better QoS (quality of services) . 

2 Prior Work 

2.1 Composition of Web Services 

Web services are loosely coupled, dynamically locatable software, which provides 
a common platform-independent framework that simplifies heterogeneous appli- 
cation integration. Web services use a service-oriented architecture (SOA) that 
communicates over the web using XML messages. The standard technologies for 
implementing the SOA operations include Web Services Description Language 
(WSDL), Universal Description, Discovery and Integration (UDDI), Simple Ob- 
ject Access Protocol (SOAP) [8], and Business Process Execution Language for 
Web Services (BPEL4WS). Such function-oriented approaches have provided 
guidelines for planning of web service compositions [4,9]. However, the tech- 
nology to compose web services has not kept pace with the rapid growth and 
volatility of available opportunities [10]. While the composition of web services 
requires considerable effort, its benefit can be short-lived and may only sup- 
port short-term partnerships that are formed during execution and disbanded 
on completion [10]. 

Web services composition can be conceived as two-plrase procedure, involv- 
ing planning and execution [11]. The planning phase includes determining se- 
ries of operations needed for accomplishing the desired goals from user query, 
customizing services, scheduling execution of composed services and construct- 
ing a concrete and unambiguously defined composition of services ready to be 
executed. The execution phase involves process of collaborating with other ser- 
vices to attain desirable goals of the composed services. The overall process has 
been classified along several dimensions. The dimension most relevant for our 
discussion is: pre-compilecl vs. dynamic composition [12]. Compared with the 
pre-compilation approach, dynamic compositions can better exploit the present 
state of services, provide runtime optimizations, as well as respond to changes 
in the business environment. But on the other hand, dynamic compositions of 
web services is a particularly difficult problem because of the continued need to 
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provide high availability, reliability, and scalability in the face of high degrees 
of autonomy and heterogeneity with which services are deployed and managed 
on the web [3]. The use of intelligent agents has been suggested to handle the 
challenges. 



2.2 Intelligent Agents for Web Service Composition 

There is increasing recognition that web services and intelligent agents represent 
a natural match. It has been argued that both represent a form of “modularized 
intelligence” [7]. The analogy has been carried further to articulate the ultimate 
challenge as the creation of effective frameworks, standards and software for au- 
tomating web service discovery, execution, composition and interoperation on 
the web [13]. Following the discussion of web service composition above, the role 
of intelligent agents may be identified as on-demand planning, and proactively re- 
sponding to changes in the environment. In particular, planning techniques have 
been applied to web services composition. Kay [14] describes the ATL Postmaster 
system that uses agent-based collaboration for service composition. A drawback 
of the system is that the ATL postmaster is not fault-tolerant. If a node fails, 
the agents residing in it are destroyed and state information is lost. Maamar 
and et. al. [15] propose a framework based on software agents for web services 
composition, but fail to tie their framework to web services standards. It is not 
clear how their framework will function with BPEL4WS and other web services 
standards and handle exceptions. Srivastava and Koehler [4], while discussing 
use of planning approaches to web services composition, indicate planning alone 
is not sufficient; and useful solutions must consider failure handling as well as 
composition with multiple partners. Effective web service composition, thus, re- 
quires expertise regarding available services, as well as process decomposition 
knowledge. A particular flavor of intelligent agents, called team-based agents, 
allows expertise to be distributed, making them a more appropriate fit for web 
services composition. 

Team-based agents are a special kind of intelligent agents with distributed ex- 
pertise (knowledge) and emphasize on cooperativeness and proactiveness in pur- 
suing their common goals. Several computational models of teamwork have been 
proposed including GRATE* [16], STEAM [17] and CAST [6]. These models al- 
low multiple agents to solve (e.g., planning, task execution) complex problems 
collaboratively. In web services composition, team-based agents can facilitate 
a distributed approach to dynamic composition, which can be scalable, facili- 
tate learning about specific types of services across multiple compositions, and 
allow proactive failure handling. In particular, the CAST architecture (Collabo- 
rative Agents for Simulating Teamwork) [6] offers a feasible solution for dynamic 
web services composition. Two key features of CAST are (1) CAST agents can 
work collaboratively using a shared mental model of the changing environment; 
(2) CAST agents proactively inform each other of changes that they perceive 
to handle any exceptions that arise in achieving a team goal. By collaboratively 
monitoring the progress of a shared process, a team of CAST agents can not only 
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initiate helping behaviors proactively but can also adjust their own behaviors to 
the dynamically changing environment. 

In the rest of this paper, we first propose a generic team-based agent frame- 
work for dynamic web-service composition, and then extend the existing CAST 
agent architecture to realize the framework. 

3 A Methodology for Interleaved Composition 
and Execution 

We illustrate the proposed methodology with the help of an example that demon- 
strates how team-based agents may help with dynamic web services composition. 

The example concerns dynamic service outsourcing in a virtual software de- 
velopment organization, called ‘VOSoft’. VOSoft possesses expertise in designing 
and developing software packages for customers from a diversity of domains. It 
usually employs one of two developing methodologies (or software processes): 
prototype-based approach (Mp) is preferred for software systems composed of 
tightly-coupled modules (integration problems reveal earlier), and unit-based 
approach (Mu) is preferred for software systems composed of loosely-coupled 
modules (more efficient due to parallel tasks). 

Suppose a customer “WSClient” engages VOSoft to develop CAD design- 
software for metal casting patterns. It is required that the software is able to 
(a) read AutoCAD drawings automatically, (b) develop designs for metal cast- 
ing patterns, and (c) maintain all the designs and user details in a database. 
Based on its expertise, VOSoft designs the the software as being composed of 
three modules: database management system (DMS), CAD, and pattern design. 
Assume VOSoft’s core competency is developing the application logic that is re- 
quired for designing metal casting patterns, but it cannot develop CAD software 
and the database module. Hence, VOSoft needs to compose a process where the 
DMS and CAD modules could be outsourced to competent service providers. 

In this scenario, several possible exceptions may be envisioned. We list three 
below to illustrate the nature and source of these exceptions. First, non-per- 
formance by a service provider will result in a service failure exception, which 
may be resolved by locating another service to perform the task. Second, module 
integration exceptions may be raised if two modules cannot interact with each 
other. This may be resolved by adding tasks to develop common APIs for the 
two modules. Third, the customer may change or add new functionality, which 
may necessitate the change of the entire process. 

It is clear that both internal (capability of web services) as well as external 
(objectives being pursued) changes can influence the planning and execution of 
composite web services in such scenarios. It thus requires an approach being able 
to monitor service execution and proactively handle services failures. 

3.1 Composing with Team-Based Agents 

A team-based agent A is defined in terms of (a) a set of capabilities (service 
names), denoted as Ca, (b) a list of service providers SPa under its management, 
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and (c) an acquaintance model Ma (a set of agents known to A, and their 
respective capabilities: Ma = {(*,(7,)}). 

An agent in our framework may play multiple roles. First, every agent is a 
Web-service manager. An agent A knows which providers in SPa can offer a 
service S (S G Ca), or at least knows how to find a provider for S (e.g. by 
searching the UDDI registry) if none of the providers in SPa are capable of 
performing the service. Services in Ca are primitive to agent A in the sense 
that it can directly delegate the services to appropriate service providers. Sec- 
ond, an agent becomes a service composer upon being requested of a complex 
service. An agent is responsible for composing a process using known services 
when it receives a user request that falls beyond its capabilities. In such situa- 
tions, the set of acquaintances, Ma, forms a community of contacts available to 
agent A. The acquaintance model is dynamically modified based on the agent’s 
collaboration with other agents (e.g., assigning credit to those with successful 
collaborations) . This additional, local knowledge supplements the global knowl- 
edge about publicly advertised web services (say, on the UDDI registry). Third, 
an agent becomes a team leader when it tries to forming a team to honor a 
complex service. 




Fig. 1. Team formation and Collaboration 



3.2 Responding to Request for a Complex Service 

An agent, upon receiving a complex service request, initiates a team formation 
process: 

(1) The agent (say, C ) adopts “offering service S” as its persistent goal. 

(2) If S' G Cc (i.e., S is within its capabilities), agent C simply delegates S to a 
competent provider (or first finds a service provider, if no provider known to C 
is competent). 
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(3) If S ^ Cc (i.e., agent C cannot directly serve S), then C tries to compose 
a process (say, H) using its expertise and the services in Cc U (J^ Ci)eM c < ~' i 
(i.e. , it considers its own capabilities and the capabilities of those agents in its 
acquaintance model), then starts to form a team: 

i. Agent C identifies teammates by examining agents in its acquaintance model 
who have the capability to contribute to the process (i.e. A £ Me, and 
Sh H C a 7 ^ 0, where Sh is the set of services used in process H). 

ii. Agent C chooses willing and competent agents from Me (e.g., using contract- 
net protocol [18]) as teammates, and shares the process H with them with 
a view to working together as a team jointly working on H. 

(4) If the previous step fails, then agent C either fails in honoring the external 
request (is penalized), or, if possible, may proactively discover a different agent 
(either using Ma or a using UDDI) and delegate 5 to it. 

[Example] Suppose agent VO So ft composes a top-level process as shown in 
Figure 1(a). In the process, the “contract” service is followed by a choice point, 
where VO Soft needs to make a decision on which methodology (Mp or Mu) 
to choose. If Mu is chosen, then services DMS-WS, CAD-WS and Pattern-WS 
are required; if Mp is chosen, then services need to be more refined so that in- 
teractions between service providers in the software development process could 
be carried out frequently to avoid potential integration problems at later stages. 
Now, suppose VO Soft chooses branch Mu , and manages to form a team includ- 
ing agents Tl, T 3 and VOSoft to collaboratively satisfy the user’s request. It is 
possible that agent T4 was asked but declined to join the team for certain reason 
(e.g., lack of interest or capacity). After the team is formed, each agent’s respon- 
sibility is determined and mutually known. As a team leader, agent VOsoft 
is responsible for coordinating others’ behavior to work towards their common 
goal, and making decisions at critical points (e.g., adjust the process if service 
fails). Agent Tl is responsible for service DMS-WS; and agent T 3 is responsible 
for service CAD-WS. As service managers, both Tl and T 3 are responsible for 
choosing an appropriate service provider for service DMS-WS and CAD-WS, 
respectively. 



3.3 Collaborating in Service Execution 

The sharing of high-level process enables agents in a team to perform proactive 
teamwork behaviors during service execution. 

Proactive Service Discovery: Knowing the joint responsibility of the team and 
individual responsibility of team members, one agent can help another find web 
services. For example, in Figure 1(b), agent Tl is responsible for contributing 
service D-design. Agent T 3, who happened to identify a service provider for 
service D-design while interacting with the external world, can proactively inform 
Tl about the provider. This can not only improve Tl’s competency regarding 
service D-design, but also can enhance T3’s credibility in Tl’s acquaintance 
model. 
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Proactive Service Delegation: An agent can proactively contract out services to 
competent teammates. For example, suppose branch Mu is selected and service 
CAD-WS is a complex service for T 3, who has composed a process for CAD-WS 
as shown in Figure 1(b). Even though T 3 can perform C-design and C-code, 
services C-test and C-debug are beyond its capabilities. In order to provide the 
committed service CAD-WS, T 3 can proactively form another team and dele- 
gate the services to the recruited agents (i.e. , T 6). It might be argued that agent 
VO Soft would have generated a high-level process with more detailed decom- 
position, say, the sub-process generated by T 3 for CAD-WS were embedded (in 
the place of CAD-WS) as a part of the process generated by VOSoft. If so, T 6 
would have been recruited as VOSoft ' s teammate, and no delegation will be 
needed. However, the ability to derive a process at all decomposition levels is 
too stringent a requirement to place on any single agent. One benefit of using 
agent teams is that one agent can leverage the knowledge (expertise) distributed 
among team members even though each of them only have limited resources. 
Proactive Information Delivery: Proactive information delivery can occur in var- 
ious situations, (i) A complex process may have critical choice points where 
several branches are specified, but which one will be selected depends on the 
known state of the external environment. Thus, teammates can proactively in- 
form the team leader about those changes in states that are relevant to its 
decision-making, (ii) Upon making a decision, other teammates will be informed 
of the decision so that they can better anticipate potential collaboration needs, 
(iii) A web service (say, the service Test in branch Mu) may fail due to many 
reasons. The responsible agent can proactively report the service failures to the 
leader so that the leader can decide how to respond to the failure: choose an 
alternative branch (say, M p ) , or request the responsible agent to re-attempt the 
service from another provider. 



4 The CAST-WS Architecture 

We have designed a team-based agent architecture CAST-WS (Collaborative 
Agents for Simulating Teamwork among Web Services) to realize our methodol- 
ogy (see Figure 2). In the following, we describe the components of CAST-WS 
and explain their relationships. 

4.1 Core Representation Decisions 

The core representation decisions that drive the architecture involve mapping 
concepts from team-based agents to composition and execution of complex web 
services with an underlying representation that may be common to both do- 
mains. Such a representation is found in Petri nets [19]. The CAST architecture 
utilizes hierarchical predicate-transition nets as the underlying representation for 
specifying plans created and shared among agents. In the web service domain, 
the dominant standard for specifying compositions, BPEL4WS can also be in- 
terpreted based on a broad interpretation of the Petri net formalism. Another 
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Fig. 2. The CAST-WS Architecture 



key building block for realizing complex web services, protocols for conversations 
among web services [20], uses state-based representations that may be mapped 
to Petri-net based models for specifying conversation states and their evolution. 
As a conceptual model, therefore, a control-oriented representation of workflows, 
complex web services and conversations can share the Petri-net structure, with 
the semantics provided by each of the domains. The mapping between team- 
based agents and complex web services is summarized in Table 1 below. 

Table 1 . Mapping between Team-based Agents and Complex Web Services 



Team-based Agents 


Complex Web Services 


Representation 


Agent 


Service provider 


entity 


Plan 


Process Specification 


MALLET, BPEL4WS 


Goal 


requests/tasks 


predicates 


Agent Capability 


Services 


WSDL 


Agent Interaction 


Conversations 


Petri Net 


Process knowledge 


No corresponding concept 


Petri Net 


Environment knowledge 


No corresponding concept 


Horn clauses 



Following this mapping, we have devised the following components of the ar- 
chitecture: service planning (i.e. composing complex web services), team coordi- 
nation (i.e. knowledge sharing among web services), and executing (i.e. realizing 
the execution of complex web services) . 

4.2 WS-Planning Component 

The Planning component is responsible for composing services and forming 
teams. This component includes three modules. The service discovery module is 
used by service planner to lookup in UDDI registry for required services. The 
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team formation module, together with acquaintance model, is used to find team 
agents who can support the required services. A web service composition starts 
from user’s request. The agent who gets the request is the composer agent, who 
is in charge of fulfilling the request. Upon receiving a request, the composer 
agent turns the request into its persistent goal and invokes its service planner 
module to generate a business process for it. 

The architecture uses hierarchical predicate-transition nets (PrT nets) to 
represent and monitor business processes. PrT Nets consists of the following 
components: (1) a set P of token places for controlling the process execution; 
(2) a set T of transitions, which represent either an abstraction of a sub PrT 
net (i.e. an invocation of some sub-plans), or an operation (e.g., primitive web 
service). A transition is associated with preconditions (predicates), which is used 
to specify conditions for continuing the process. (3) a set of arcs over P xT that 
describes the order of execution that the team will follow; and (4) a labeling 
function on arcs, which are tuples of agents and bindings for variables. 

The services that are used by the service planner for composing a process 
come from two sources: the UDDI directory and the acquaintance model. Assume 
from the requested service we can derive a set of expected effects, which will be 
the goals to be achieved by CAST agents. Given any set of goals G, a partial 
order (binary relation) can be defined over G : \/g 1 £ G, g2 £ G, g 3 £ G, 

(1) fld < fid; 

(2) gl < g2 , g2 < gl gl = g2\ 

(3) gl < g2 , g2 < g3 gl < g3. 

Given (G, <), \/g £ G, its pre-set, denoted as *g, is defined as {g' £ G\g' < 
g , and JBg" ^ g such that g ' < g" < g}\ its post-set, denoted as g», is defined 
as {g 1 £ G\g < g', and /Bg" ^ g such that g < g" < g 1 }. Given (G, <), any 

Gl C G, G2 C G, Gl and G2 are independent iff Wg £ Gl, /Bg’ £ G2 such 

that g < g' or g' < g , and vice versa. Element g £ G is indetachable from G iff 
3 g' £ G such that g < g' or g' < g. 

The following algorithm is used by the service planner to generate a Petri-net 
process for a given goal (service request). 

Algorithm: ServicePlanning(Goal g) 

1. Let G = {</} 

2. If g can be divided into {51, 32, • • • , gk} then Let G = {51, g 2 , ■ ■ ■ ,gk} 

3. Partition G into Gi, G2, • • • , G m such that they are pairwisely 
independent, and any goal in Gi (1 < i < to) is indetatchable from Gi. 
Since each G, is an ordered set, we denoted them as (Gi, <). 

4. Create a PrT net PN with a parallel construct where Gi (1 < i < to) 
are the branches; 

5. For i from 1 to m DO 

If (expandFurther is True) then 
sub-net = Expanding(Gi, < ) 

Replace Gi in PN with sub-net 

6. return PN. 
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Algorithm: Expanding (GoalSet G, Order <) 

1. Create a net SN for G based on the order information: 

a. If g2 depends on g 1, use sequential construct SEQ; 

b. If the chosen of gl and g2 depends on the truth value of some 
condition, use conditional construct IF; 

c. If a goal (or sequence of goals) needs to be done repeatedly, use 
iterative construct WHILE ; 

d. If two or more goals can be done in parallel, 
use parallel construct PAR ; 

e. Make sure for any g £ G, there is a token place between g and 
any goal in gu. 

2. For any g in net SN 

a. If a service can achieve g then replace g with the service (name); 

b. If a plan (process) can achieve g then replace g with the plan; 

c. If more than one plan (process) can achieve g then 
replace g with a choice construct CHOICE; 

d. Otherwise, if (expandFurther is True ) then Serviceplanning (g ); 

3. return SN. 

4.3 The Team Coordination Component 

The team coordination component is used to coordinate with other agents during 
service execution. This component includes an inference engine with a built-in 
knowledge base, a process shared by all team members, a PrT interpreter, a 
plan adjustor, and an inter-agent coordination module. Knowledge base holds 
the (accumulated) expertise needed for service composition. The inter-agent co- 
ordination module, embedded with team coordination strategies and conversa- 
tion policies [21], is used for behavior collaboration among teammates. Here we 
mainly focus on the process interpreter and the plan adaptor. 

Each agent in a team uses its PrT net interpreter to interpret the business 
process generated by its service planner, monitor the progress of the shared pro- 
cess and takes its turn to perform tasks assigned to it. If the assigned task is 
a primitive web service, the agent invokes the service through its BPEL4WS 
process controller. If a task is assigned to multiple agents, the responsible agents 
coordinate their behavior (e.g., not compete for common resources) through the 
inter-agent coordination module. If an agent faces an unassigned task, it evalu- 
ates constrains associated with the task and tries to find a competent teammate 
for the task. If the assigned task is a complex service (i.e. further decomposition 
required) and is beyond its capabilities, the agent treats it as an internal request, 
composes a sub-process for the task, and forms another team to solve it. 

The plan adjustor uses the knowledge base and inference engine to adjust 
and repair the process whenever an exception or a need for change in the pro- 
cess arises. The algorithm used by the plan adjustor utilizes the failure handling 
policy implemented in CAST. Due to the hierarchical organization of the team 
process, each CAST agent maintains a stack of active process and sub-processes. 
A sub-process returns the control to its parent process when its execution is com- 
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pleted. Failure handling is interleaved with (abstract) service executing: execute 
a service; check termination conditions; handle failures, and propagate failures to 
the parent process if needed. The algorithm captures four kinds of termination 
modes resulting from a service execution. The first (i.e. return 0) result indicates 
the service is completed successfully. The second (i.e. return 1) indicates that the 
process is terminated abnormally but the expected effects from the service has 
already been achieved “magically” (e.g. by proactive help from teammates). The 
third (i.e., return 2) indicates that the process is not completed and is likely at 
an impasse. In this case, if the current service is just one alternative of a choice 
point, another alternative can be selected to re-attempt the service. Otherwise, 
the failure is propagated to the upper level. The fourth (i.e. return 3) indicates 
that the process is terminated because the service has become irrelevant. This 
may happen if the goal or context changes. In this case, the irrelevance is prop- 
agated to the parent service, which checks its own relevance. The plan adjustor 
algorithm is shown below. 

Algorithm: ServiceExecution (Level i, Service S) 

1. Let P be the process(plan) for S 

2. SS = getNextService(P) 

3. While SS is not null /*P is not completed 

terminateCode= ServiceExecution(i + 1, SS) 
if terminateCode !=0 
if terminateCode =1 

SS = getNextService(P) 
if terminateCode =2 

if SS is one branch of a choice point C 
SS = choose AnotherW ay (P,C) 
else return 2 
if terminateCode =3 

if (the execution of S is irrelevant) return 3 
else SS = getNextService(P) 

4. end while 

5. return 0 

4.4 The WS-Execution Component 

A service manager agent executes the primitive services (or a process of primitive 
services) through the WS-Execution component. The WS-Execution component 
consists of a commitment manager, a capability manager, a BPEL4WS process 
controller, an active process, and a failure detector. The capability manager 
maps services to known service providers. The commitment manager is used to 
schedule the services assigned to it in an appropriate order. 

An agent ultimately needs to delegate those contracted services to appro- 
priate service providers. The process controller generates a BPEL4WS process 
based on the WSDL of the selected service providers and the sequence indi- 
cated in the PrT process. The failure detector identifies execution failure by 




An Agent-Based Approach for Interleaved Composition 593 




Urea KansKion I ougle Layout Print Net 



Fig. 3. The relations between generated team process and other modules 



checking the termination conditions associated with services. If a termination 
condition has been reached, the failure detector throws an error and the plan 
adjustor module is invoked. If it is a service failure, the plan adjustor simply 
asks the agent to choose another service provider and re-attempt the service; if 
it is a process failure (unexpected changes make the process unworkable), the 
plan adjustor back-tracks the PrT process, tries to find another (sub-)process 
that would satisfy the task, and uses it to fix the one that failed. 



4.5 The Example Revisited 

Figure 3 shows how web service composition for VOSoft may be performed 
with interleaved planning and execution. The figure shows the core (hierarchi- 
cal) Petri net representation used by the CAST architecture, and the manner in 
which each of the modules in the architecture use this representation. Due to the 
dynamic nature of the process, it is not feasible to show all possible paths that 
the execution may take. Instead, we show one plausible path, indicating the re- 
sponsibilities for each of the modules in the architecture such as planning, team 
formation, undertaking execution, sensing changes in the internal or external en- 
vironment (that may lead to exceptions), proactive information sharing, and how 
these will allow adapting the process to changes in the environment (proactive 
exception handling) . The result is an interleaved process that includes planning 
and execution. The figure shows mapping to elements of the web service tech- 
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nology stack (e.g. BPEL4WS specification), which allows use of the proposed 
architecture with current proposals from W3C. 



5 Discussion 

As business processes, specified as workflows and executed with web services, 
need to be adaptive and flexible, approaches are needed to facilitate this evolu- 
tion. The methodology and architecture we have outlined addresses this concern 
by pushing the burden of ensuring this flexibility to the web services participat- 
ing in the process. To achieve this, we have adapted and extended research in 
the area of team-based agents. A key consequence of this choice is that our ap- 
proach allows interleaving of execution with planning, providing several distinct 
advantage over current web service composition approaches to facilitate adap- 
tive workflows. First, it supports an adaptive process that suitable for the highly 
dynamic and distributed manner in which web services are deployed and used. 
The specification of a joint goal allows each team member to contribute relevant 
information to the composer agent, who can make decisions at critical choice 
points. Second, it elicits a hierarchical methodology for process management 
where a service composer can compose a process at a coarse level appropriate to 
its capability and knowledge, leaving further decomposition to competent team- 
mates. Third, it interleaves planning with execution, providing a natural vehicle 
for implementing adaptive workflows. 

Our work in this direction so far has provided us with the fundamental insight 
that further progress in effective and efficient web service composition can be 
made by better understanding how distributed and partial knowledge about the 
availability and capabilities of web services, and the environment in which they 
are expected to operate, can be shared among the team of agents that must 
collaborate to perform the composed web service. Our current work involves 
extending the ideas to address these opportunities and concerns, and reflecting 
the outcomes in the ongoing implementation. 
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Abstract. Web services promise to become a key enabling technology for B2B 
e-commerce. Several languages have been proposed to compose Web services 
into workflows. The QoS of the Web services-based workflows may play an es- 
sential role in choosing constituent Web services and determining service level 
agreement with their users. In this paper, we identify a set of QoS metrics in the 
context of Web services and propose a unified probabilistic model for describ- 
ing QoS values of (atomic/composite) Web services. In our model, each QoS 
measure of a Web service is regarded as a discrete random variable with prob- 
ability mass function (PMF). We describe a computation framework to derive 
QoS values of a Web services-based workflow. Two algorithms are proposed to 
reduce the sample space size when combining PMFs. The experimental results 
show that our computation framework is efficient and results in PMFs that are 
very close to the real model. 



1 Introduction 

Web services have become a de facto standard for achieving interoperability among 
business applications over the Internet. In a nutshell, a Web service can be regarded 
as an abstract data type that comprises a set of operations and data (or message 
types). Requests to and responses from Web service operations are transmitted 
through SOAP (Simple Object Access Protocol), which provides XML-based mes- 
sage delivery over an HTTP connection. The existing SOAP protocol uses synchro- 
nous RPC for invoking operations in Web services. However, in response to an in- 
creasing need to facilitate long running activities new proposals have been made to 
extend SOAP to allow asynchronous message exchange (i.e., requests and responses 
are not synchronous). One notable proposal is ASAP (Asynchronous Service Access 
Protocol) [1], which allows the execution of long-running Web service operations, 
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and also non-blocking Web services invocation, in a less reliable environment (e.g., 
wireless networks). In the following discussion, we use the term Web service, to refer 
to an atomic activity, which may encompass either a single Web service operation (in 
the case of asynchronous Web services) or a pair of invoke/respond operations (in the 
case of synchronous Web services), and the term WS-workflow to refer to a workflow 
composed of a set of Web service invocations threaded into a directed graph. 

Several languages have been proposed to compose Web services into workflows. 
Notable examples include WSFL (Web Service Flow Language) [13] and XLANG 
(Web Services for Business Process Design) [16]. The ideas of WSFL and XLANG 
have converged and been superceded by BPEL4WS (Business Process Execution 
Language for Web Services) specification [2]. Such Web services-based workflows 
may subsequently become (composite) Web services, thereby enabling nested Web 
Services Workflows (WS-workflows). 

While the syntactic description of Web services can be specified through WSDL 
(Web Service Description Language), their semantics and quality of service (QoS) are 
left unspecified. The concept of QoS has been introduced and extensively studied in 
computer networks, multimedia systems, and real-time systems. QoS was mainly 
considered as an overload management problem that measures non-functional aspects 
of the target system, such as timeliness (e.g., message delay ratio) and completeness 
(e.g., message drop percentage). More recently, the concept of QoS is finding its way 
into application specification, especially in describing the level of service provided by 
a server. Typical QoS metrics at the application level include throughput, response 
time, cost, reliability, fidelity, etc [12]. Some work has been devoted to the specifica- 
tion and estimation of workflow QoS [3, 7]. However, previous work in workflow 
QoS estimation either focused on the static case (e.g., computing the average or the 
worst case QoS values) or relied on simulation to compute workflow QoS in a 
broader context. While the former has limited applicability, the later requires substan- 
tial computation before reaching stable results. In this paper, we propose a probabil- 
ity-based QoS model on Web services and WS-workflows that allows for efficient 
and accurate QoS estimation. Such an estimation serves as the basis for dealing with 
Web services selection problem [11] and service level agreement (SLA) specification 
problem [6]. 

The main contributions of our research are: 

1. We identify a set of QoS metrics tailored for Web services and WS-workflows 
and give an anatomy of these metrics. 

2. We propose a probability-based WS-workflow QoS model and its computation 
framework. This computation framework can be used to compute QoS of a com- 
plete or partial WS-workflow. 

3. We explore alternative algorithms for computing probability distribution functions 
of WS-workflow QoS. The efficiency and accuracy of these algorithms are com- 
pared. 

This paper is organized as follows. In Section 2 we define the QoS model in the 
context of WS-workflows. In Section 3 we present the QoS computation framework 
for WS-workflows. In Section 4 we describe algorithms for efficiently computing the 
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QoS values of a WS-workflow. Section 5 presents preliminary results of our per- 
formance evaluation. Section 6 reviews related work. Finally, Section 7 concludes 
this paper and identifies directions for future research. 

2 QoS Model for Web Services 

2.1 Web Services QoS Metrics 

Many workflow-related QoS metrics have been proposed in the literature [3, 7, 9, 11, 
12]. Typical categories of QoS metrics include performance (e.g., response time and 
throughput), resources (e.g., cost, memory/cpu/bandwidth consumption), dependabil- 
ity (e.g., reliability, availability, and time to repair), fidelity, transactional properties 
(e.g., ACID properties and commit protocols), and security (e.g., confidentiality, non- 
repudiation, and encryption). 

Some of the proposed metrics are related to the system capacity for executing a 
WS-workflow. For example, metrics used to measure the power of servers, such as 
throughput, memory/cpu/bandwidth consumption, time to repair (TTR), and avail- 
ability, falls in the category called system-level QoS. However, the capacities of serv- 
ers for executing Web services (e.g., man power for manual activities and computing 
power for automatic activities) are unlikely to be revealed due to autonomy consid- 
eration, and may change over time without notification. These metrics might be use- 
ful in some workflow context such as intra-organizational workflows (for determin- 
ing the amount of resources to spend on executing workflows). For inter- 
organizational workflows, where a needed Web service may be controlled by another 
organization, QoS metrics in this category generally cannot be measured, and are thus 
excluded from further discussion. Another QoS metrics require all instances of the 
same Web service to share the same values. In this case, it is better to view these 
metrics as service classes rather than quality of service. Metrics of service class in- 
clude those categorized as transactional properties and security. In this paper we fo- 
cus on those WS-workflow QoS metrics that measure a WS-workflow instance and 
whose value may change across instances. These metrics, called instance-level QoS 
metrics, include response time, cost, reliability, and fidelity rating. Note that cost is a 
complicated metric and could be a function of both service class and/or other QoS 
values. For example, a Web service instance that imposes weaker security require- 
ments or incurs longer execution time might be entitled to lower cost. Some services 
may adopt a different pricing scheme that charges based on factors other than usage 
(e.g., membership fee or monthly fee). In this paper, we consider the pay-per-service 
pricing scheme, which allows us to include cost as an instance-level QoS metric. 

In summary, our work considers four metrics: Response time (i.e., time elapsed 
from the submission of a request to the receiving of the response), Reliability (i.e., 
the probability that the service can be successfully completed), Fidelity (i.e., reputa- 
tion rating) and Cost (i.e., the amount of money paid for executing an activity), which 
can be equally applicable to both atomic Web services and WS-workflows (also 
called composite Web services). These QoS metrics are defined such that different 
instances of the same Web service may have different QoS values. 
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2.2 Probabilistic Modeling of Web Services QoS 

We use a probability model for describing Web service QoS. In particular, we use 
probability mass function (PMF) on finite scalar domain as the QoS probability 
model. In other words, each QoS metric of a Web service is viewed as a discrete 
random variable, and the PMF indicates the probability that the QoS metric assumes a 
particular value. For example, the fidelity F of an example Web service with five 
grades (1-5) may have the following PMF: 

40) = 0.1,/,. (2) = 0.2,4 (3) = 0.3,4 ( 4 ) = 0.3,4 (5) = 0.1 

Note that it is natural to describe Reliability, Fidelity rating and Cost as random vari- 
ables and to model them as PMFs with domains being {0 (fail), 1 (success)}, a set of 
distinct ratings, and a set of possible costs respectively. However, it is less intuitive to 
use PMF for describing response time whose domain is inherently continuous. By 
viewing response time at a coarser granularity, it is possible to model response time 
as a discrete random variable. Specifically, we partition the range of response time 
into a finite sequence of sub-intervals and use a representative number (e.g., the 
mean) to indicate each sub-interval. For example, suppose that the probabilities of a 
Web service being completed in one day, two to four days, and five to seven days, are 
0.2, 0.6, and 0.2, respectively. The PMF of its response time X is represented as fol- 
lows: 

f x ( 1 )=0.2,4(3)=0.6 (3 is the mean of [2, 4]), and4(6)=0.2 (6 is the mean of [5, 7]) 

As expected, finer granularity on response time will yield more accurate estimation 
with higher overhead in representation and computation. We explore these tradeoffs 
in our experiments. 

2.3 WS-Workflow Composition 

For an atomic Web service, its QoS PMFs can be derived from its past records of 
invocations. For a newly developed WS-workflow that is composed of a set of atomic 
Web services, we need a way to determine its QoS PMFs. Different workflow com- 
position languages may provide different constructs for specifying the control-flow 
among constituent activities (e.g., see [14, 15] for a comparison of the expressive 
powers of various workflow and Web services composition languages). 
Kiepuszewski et al. [8] define a structured workflow model that consists of only four 
constructs: sequential, or-split/or-join, and-split/and-join, and loop, which allows for 
recursive construction of larger workflows. Although it has been shown that this 
structured workflow model is unable to model arbitrary workflows [8], it is neverthe- 
less powerful enough to describe many real-world workflows. In fact, there exist 
some commercial workflow systems that support only structured workflows, such as 
SAP R/3 and Filenet Visual workflo. 

In this paper, as an initial step of the study, we focus our attention on structured 
workflows. To distinguish between exclusive or and ( multiple choice) or, which is 
crucial in deriving WS-workflow QoS, we extend the structured workflow model to 
include five constructs: 
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1. sequential, a sequence of activities (a v a-,, ..., a n ). 

2. parallel (and split/and join): multiple activities (a 1? a 2 , . .., «„) that can be concur- 
rently executed and merged with synchronization. 

3. conditional (exclusive split/exclusive join): multiple activities (a lt a 2 , . .., a n ), 
among which only one activity can be executed. 

4. fault-tolerant (and split/exclusive join): multiple activities (a v a 2 , ..., a n ) that can 
be concurrently executed but merged without synchronization. 

5. loop: a block of activities a guarded by a condition “LC”. Here we adopt while 
loop in our following discussion. 

3 Computing QoS Values of WS Compositions 

We now describe how to compute the WS-workflow QoS values for each composi- 
tion construct introduced earlier. We identify five basic operations for manipulating 
random variables, namely (i) addition, (ii) multiplication, (iii) maximum, (iv) mini- 
mum, and (v) conditional selection. Each of these operations takes as input a number 
of random variables characterized by PMFs and produces a random variable charac- 
terized by another PMF. The first four operations are quite straightforward, and their 
detailed descriptions are omitted here due to space limitations. For their formal defi- 
nitions, interested readers are referred to [5]. The conditional selection, denoted as 
CS {X ,, pf ) , is defined as following 1 . Fet X l ,X 2 ,...,X n be n random variables, 

1<Z<ZZ 

with p t , 1 < i < n , being the probability that X t is selected by the conditional selec- 
tion operation CS. Note the selection of any random variable is exclusive, i.e., exactly 
one of these would be selected. The result of CS {X n pj) is a new random variable Z 

1 <i<n 

with Dom(Z) = u Dom(X , ) . Specifically, the VMY ff ) of Z is as follows: 

,/z (Z = z)= Pj ■ f Xj (z) , ze Dom(Z) . 

ze Dom(Xj) 

For each activity a , we consider four QoS metrics, namely response time, cost, reli- 
ability, and fidelity, denoted T{a), C(a), R{a ), and /•’(«) respectively 2 . A WS-workflow 
composed of activities a v a-,, ..., a n using some composition construct is denoted 
w(a j, a 2 , ..., a n ). The QoS values of w, under various composition constructs, are 
shown in Table 1. 

We assume that the fidelity of w using sequential or parallel composition is a 
weighted sum of the fidelities of its constituent activities. The fidelity weight of each 



1 Ensure not to confuse the conditional selection by the weighted sum Xpi-Xi. The weighted 
sum results in a random variable whose domain may not be the union of the domains of the 
constituent activities. While weighted sum is used for computing the average value of a set of 
scalar values, it should not be used to compute the PMF resulted from the conditional selec- 
tion of a set of random variables. 

2 Note that each QoS metric of an activity is NOT a scalar value but a discrete random variable 
characterized by a PMF. 
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activity can be either manually assigned by the designer, or automatically derived 
from past history, e.g. by using linear regression. For the conditional construct, ex- 
actly one activity will be selected at run-time. Thus, the fidelity of w is the condi- 
tional selection of the fidelity of its constituent activities with the associated prob- 
abilities. For the fault-tolerant construct, the fidelity of the activity that is the first to 
complete becomes the fidelity of w. Thus, F(w) = CS (F(aj),p f (a,)) , where 

1 <i<n 

Pt(.a i ) = Y[P(T(a k )>T(a i )) . 
k*i 

A loop construct is defined as a repetition of a block guarded by a condition “LC\ 
i.e., this block is repetitively executed till the condition “LC” no longer holds. Car- 
doso et al. assumed a geometric distribution on the number of iterations [3]. How- 
ever, the memoryless property of the geometric distribution fails to capture a common 
phenomenon that a repeated execution of a block usually has a better chance to exit 
the loop. Gillmann et al [7] assumed the number of iterations to be uniformly distrib- 
uted, which again may not hold in many applications. In this paper, rather than as- 
suming a particular distribution, we simply regard the number of iterations as a PMF 
with a finite scalar domain. Let/ L(fl) (7),0</<c, be the PMF of the number of iterations 
of a loop structure L defined on a block a , where c is the maximum number of itera- 
tions. Let T(a ), C(a), R(a), F(ci ) denote the PMFs of the response time, cost, reliabil- 
ity, and fidelity of a respectively. If a is executed for 1 times, the response time T (t) 

is T a (/) = ^ T(a) ■ The response time of L is the conditional selection on TJl) with 

1 <i<I 

probabilities / /(a) (/), 0 </<c. Thus, the response time of L is 

T(L) — CS (T (/), f L( (/)) . Similar arguments can be applied to the computation of 

1 </<C ' ' 

cost and reliability. Regarding fidelity, let p\ be the probability of executing at least 
one iteration and p 0 =l-p l . When a is executed at least once, the fidelity of a loop 
structure, in our view, is determined simply by its last execution of a. Let F (T) de- 
note the fidelity that a is executed at least once (i.e., F (T)=F(a)) and F (F) be the 
fidelity that a is not executed. The fidelity of L is therefore computed as follows: 

F{L)= CS (F a (i), Pi ). 

ie{F,T] 



4 Efficient Computation of WS-Workflow QoS 

4.1 High Level Algorithm 

A structured WS-workflow can be recursively constructed by using the five basic 
constructs. Figure 1 shows an example WS-workflow, namely PC order fulfillment. 
This WS-workflow is designed to tailor-make and to deliver personal computers at a 
customer’s request. At the highest level, the WS-workflow is a sequential construct 
that consists of Parts procurement. Assembly, Test, Adjustment, Shipping, and Cus- 
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Table 1 . The QoS values of a WS-workflow w under various composition constructs 



Compo- 

sition 

Con- 

struct 


Cost: C’(vv)) 


Response time: 
T{w) 


Reliability: R(w) 


Fidelity: F(w) 


Sequen- 

tial 


n 

i= 1 


i=l 


n 

i=l 


n 

Yj W i F ( a i) 

i=l 


Parallel 


Z c (a,) 

i= 1 


Max\T ( a j )} 

i 


n 

i=l 


n 

1=1 


Condi- 

tional 


CS(C( ai ), Pi ) 

1 <i<n 


CS{T{a,), Pi ) 

1 <i<n 


CS (R(a i ),p i ) 

1 <i<n 


CS {F{a i ),p i ) 

1 <i<n 


Fault- 

tolerant 


n 

2>«,) 
i= 1 


Min{T(a i )\ 

i 


n 

i-rid-^)) 

i=i 

where f l R ( 0)= f R ( 1) 
and 


CS (Flaf, Pfia,)) 

1 <i<n 

where 

pM) = Y[nna P >T(a,D 

k*i 


Loop 


CS{C a {l),f Ua ,{l)) 

where 

C a (l)=Y J C(a) 

\<i<l 


CSJTAl), /„.,(/)) 

where 

T a (D=Y J T(a) 

!</</ 


1 <1<C 

where 

Rji)=nm 

!</</ 


CS(F a (i),Pi) 

ie{F,T) 

where F fe (T)=F(a)) 
and F a { F) be the 

fidelity that a is not 
executed. 




Fig. 1. An example WS-workflow PC order fulfillment 

Customer notification. Parts procurement is a parallel construct that comprises of 
CPU procurement, HDD procurement, and CD-ROM procurement. CPU procure- 
ment in turn is a conditional construct composed of Intel CPU procurement and AMD 
CPU procurement. Adjustment is a loop construct on Fix&Test, which is iteratively 
executed until the quality of the PC is ensured. Customer notification is a fault- 
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ComputeQoS(A: a WS-workflow activity) 



IF A. type * ATOMIC THEN { 

FOR (each activity teA. activities) DO 
ComputeQoS(t); 

IF A.construct = SEQUENTIAL THEN 

A.QoS = SequentialQoS(A.actiwities)', 
ELSEIF A.construct = PARALLEL THEN 

A.QoS = ParallelQoS{A. activities); 
ELSEIF A.construct = CONDITIONAL THEN 

A.QoS = ConditionalQoSiA. activities); 
ELSEIF A.construct = FAULT_TOLERANT THEN 

A.QoS = FaultTolerantQoS(A.activities ); 
ELSE // A.construct = LOOP 

A.QoS = LoopQoS(A. activities); 



ELSE Estimate the QoSs of A and put them in A.QoS; 



}; 



Fig. 2. Pseudo code for computing QoS of a WS-workflow 



tolerant construct that consists of Email notification and Phone notification. The suc- 
cess of either notification marks the completion of the entire WS-workflow. 

The QoS of the entire WS-workflow can be recursively computed. The pseudo- 
code is listed in Figure 2. Note that SequentialQoS(A. activities), ParallelQoS 
(A. activities), ConditionalQoSiA. activities), FaultTolerantQoS{A. activities), Loop- 
QoS(A. activities) are used to compute the four QoS metric values for sequential, 
parallel, conditional, fault tolerant, and loop constructs respectively. Their pseudo 
codes are quite clear from our discussion in Section 3 and omitted here for brevity. 

4.2 Sample Space Reduction 

When combining PMFs of discrete random variables with respect to a given opera- 
tion, the sample space size of the resultant random variable may become huge. Con- 
sider adding k discrete random variables each having n elements in their respective 
domains. The sample space size of the resultant random variable, in the worst case, is 
of the order of n k . In order to keep the domain of a PMF after each operation at a 
reasonable size, we propose to group the elements in the sample space. In other 
words, several consecutive scalar values in the sample space will be represented by a 
single value and the aggregated probability is computed. The problem is formally 
described below. 

Let the domain of X be { x l ,x 2 ,...,x s }, where x, < jt /+1 ,l <i <s , and the PMF of X 
be f x . We called another random variable Y an aggregate random variable of X if 
there exists a partition (j 0 ,j 1 ,j 2 , of (x l ,x 2 ,...,x s ), where l=j 0 <j 1 <j 2 < 

\<j m =s+l, such that domain of Y is 
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j - 1 

and the PMF for Y is f Y (y r )= ^ f x (x k ) , I <i<m. The aggregate error of Y with re- 

Wr-1 

spect to X, denoted aggregcite_error{Y, X), is the mean square error defined as fol- 
lows: 

m j r ~ 1 

ciggregate_error(Y, X)= II fx(x k )-(x k -y r ) 2 . 

r=l k=j r _, 



Aggregate Random Variable Problem 

Given a random variable X of domain size s and a desired domain size m. the problem 
is to find an aggregate random variable Y of domain size m such that its aggregate 
error with respect to X is minimized. 

Dynamic Programming Method 

An optimal solution to this problem can be obtained by formulating it as a dynamic 
program. Let e{i,j,k ) be the optimal aggregate error of partitioning x t ,x i+] 

into k subsequences. We have the following recurrence: 

e(i,j,k)= min (e(i,a,b) + e(a + l,j,k -b)) ifj-i+l>k and fc>l 

i<a< j,l<b<k 

e(i,j , k ) = 0 if j-i+\=k 
e(i, j, 1) = error(i,j), 

where error(i,j) is the aggregated error introduced in representing {x ( ,x ;+1 ,...,xy } by 
a single value. Specifically, error (i, j) = ^ (x k ) ■ (x k -x) 1 , where 

i<k< j 

'Yjfx( x k)- x k 

- i<k< j 

X — “ " " • 

^ \fx( x k ) 

i<k< j 

The time complexity of the dynamic programming algorithm is 0(s 3 n 7") , and its 
space complexity is 

Greedy Method 

To reduce the computation overhead, we propose a heuristic method for solving this 
problem. The idea is to continuously merge the adjacent pair of samples that gives the 
least error until a reasonable sample space size is reached. When an adjacent pair (x ; , 
x i+ 1 ) is merged, a new element x’ is created to replace x ; and x i+1 , where 

fx ( x i )-Xj+ f x (x i+ i ) • 
fx( x i) + fx( x i + 1) 



and f x (x' ) = f x (x,- ) + f x (x M ) . 
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The error of merging (x ; , x ( + 1 ). denoted pair_error{x i , x t + 1 ), is computed as follows: 

pair _error(x t ,x M ) = f x (Xi) ■ (Xf - x') 2 +f x (x M )-(x M -x') 2 . 

We can use a priority queue to store the errors of merging adjacent pairs. In each 
iteration, we perform the following steps: 

1 . Extract an adjacent pair with the least pair_errorQ value from the priority queue, 
say O,, x M ). 

2. Replace x ; and x i+l by the new value x’ in the domain of X. 

3. Compute pair_error(x i _ l ,x’) if i> 1 and pair_error(x’, x i+2 ) if i<n- 1. Delete 
pair_error(x iA pc t ) and pair_error(x i+1 ,x i+2 ) from the priority queue, and insert 
pair _error(x i , ,x ’) and pair_error(x’ , x j+2 ) into the priority queue. 

In each iteration, step 1 and 3 takes 0(lg.v) time while step 2 takes only constant 
time. The total number of iterations is s-m. Thus, the time complexity of this greedy 
approach is (XslgT). 

When it comes to combining the QoS values of n activities, we can perform pair- 
wise QoS combinations n- 1 times and apply a sample space reduction method as 
described above after each combination. Suppose each random variable has the same 
domain size m. The addition of two random variables may result in a new random 
variable with domain size up to m 2 . We then apply some sample space reduction 
technique to reduce the domain size down to m before combining the next random 
variable. Although there exist a large number of possible orders in which one can 
combine the n activities, we have concluded from our preliminary experiments that 
different orders of activity combination has little effect on the resultant aggregate 
errors. Therefore, we simply choose an arbitrary order for pair-wise combinations. 

5 Performance Evaluation 

In this section we report the results of our initial evaluation of the proposed probabil- 
istic QoS computation framework. Due to space limitation, we only show the experi- 
mental results on one QoS metric, namely the response time. Other metrics exhibit 
similar trends. 

This first set of experiments aims to show how the two proposed sample space size 
reduction techniques, namely dynamic programming and the greedy method, impact 
the accuracy of the resultant PMFs. Consider a simple workflow that consists of only 
two sequential activities. Let Z be the response time of the workflow, and X and Y be 
the response times of the two activities. Obviously Z=X+Y. For ease of comparison, 
we assumed that the generative models for both X and Y are both normal distribu- 
tions, which allows us to theoretically compute the generative model of Z. The per- 
formance metric considered is cumulative distribution function (CDF). The CDF of Z, 
denoted as F z ("), is defined as Pr(Z<z). We exercised 100 values for z ranging from 
100 to 270. Figure 3(a) shows the CDFs obtained by using dynamic programming 
and the greedy method, with the theoretical method serving as the benchmark, which 
is directly computed from the generative model of Z. As can be seen, the CDFs ob- 
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Fig. 3. (a) CDFs of the dynamic programming and the greedy method with the theoretical 
method serving as the benchmark; (b) the CDF difference of each method from the theoretical 
method 




Fig. 4. Mean square errors of dynamic programming and greedy methods over different sample 
space sizes 

tained from dynamic programming and greedy methods are very close to the theoreti- 
cal CDF. To show the subtle differences, we plotted the CDF differences of dynamic 
programming and greedy methods obtained from substracting the theoretical CDF 
from both CDFs, which are shown in Figure 3(b). Although the differences are in- 
deed very small (less than 0.01), it can still be seen that dynamic programming in 
general leads to better CDF with the mean probability error of 0.001494, compared to 
0.002136 for the greedy method. 

We next vary the domain size of the aggregate random variable Z and compare the 
mean square errors incurred by both dynamic programming and greedy methods. 
Figure 4 shows the experimental results. As expected, a larger domain size leads to a 
smaller mean square error. Besides, the greedy method constantly incurs slightly 
higher mean square errors over different domain sizes. 

We finally apply the proposed framework to compute the response time of the ex- 
ample WS-workflow shown in Section 4.1, namely PC order fulfillment. The genera- 
tive model of each atomic activity is assumed to be the normal distribution with 
minimum and maximum value constraints. To evaluate the PMF of the example WS- 
workflow computed by our framework, we ran simulation and generated 1 million 
instances. Thus, we were able to get a stable PMF of the total response time that is 
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Fig. 5. Cumulative probability error of response time of PC order fulfillment (aggregate sample 
space size = 30) 

very close to the real one. We then used the PMF generated by simulation as the 
benchmark. Figure 5 shows the difference between the cumulative probabilities com- 
puted using the greedy method and the benchmark, when the aggregate sample space 
size is set to 30. It can be seen that the difference is very small, i.e., no higher than 
0.008. In contrast, the running time difference is substantial. Our experimental plat- 
form was a PC server equipped with Intel P4 2.66 GHz CPU and 512 MB main 
memory. The simulation program took about 3.5 seconds to generate one million 
instances, while the greedy method based approach took only about 3 ms to compute 
the response time PMF of the example WS-workflow. 

6 Related Work 

QoS-based Web service selection has attracted a lot of attention in recent years. 
While previous work attempted to optimize the selection for a single activity, recent 
work has focused on the selection of Web services in order to satisfy the QoS re- 
quirements of a WS-workflow. Patel et al. [11] proposed an approach to select a 
number of Web services for a given activity, and to distribute activity instances to 
these Web services according to their QoS values. Menasce [10] proposed a scheme 
to estimate the throughput of a composite Web service from those of its constituent 
Web services, which serves as a basis for selecting Web services. However, the deri- 
vation of various QoS values of a composite Web service was not a focus of these 
efforts. Zeng et al. [17] identified a set of QoS metrics and proposed to apply linear 
programming to select an execution plan that has the optimal QoS value. However, 
the workflow constructs considered in their work are limited and do not include loop 
and fault-tolerance. Besides, only deterministic QoS values are computed. 

Our work is closest to the Stochastic workflow reduction technique [3], which it- 
eratively applies a set of reduction rules until a single workflow is reached. Further- 
more, Cardoso [3] identified the same four QoS measures as used in our work. How- 
ever, the mathematical approach proposed there can only be used to compute 
deterministic QoS values (e.g., average response time), while our probability-based 
QoS model enables the estimation of a probability distribution function for a given 
workflow, which allows broader applicability. 

Finally, several previous efforts have proposed using simulation to measure the 
QoS of a workflow [3, 6, 7], with a major focus on timeliness measures. The Simula- 
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tion approach models a workflow as a queuing system with transition probabilities 
associated with conditional branches. In addition to its high computation overhead, 
this approach requires the specification of server capacities. In other words, process- 
ing time and number of servers associated with each activity have to be specified. 
While such a requirement causes little problem for an (intra-organizational) work- 
flow, it is impossible to do so for a WS-workflow that spans the boundaries of several 
autonomous enterprises. 

CPM (Critical Path Method) and PERT (Program Evaluation Review Technique) 
are project management techniques for project Planning, Scheduling and Control [4], 
The key idea of CPM/PERT is to identity the critical path, which is the longest path 
through the activity network controlling the entire project. In CPM, the time of each 
activity is deterministic. PERT assumes Beta distribution and independence of the 
time of each activity. It uses three parameters, namely the Most Optimistic, Most 
Likely, and Most Pessimistic estimates, to determine the mean value |i and standard 
deviation 0 for the execution time of an activity using the following formulae: 

|i= (Most Optimistic + (4 x Most Likely) + Most Pessimistic)/6 
0 = (Most Pessimistic - Most Optimistic)/6 

Compared to the structure of business processes, the network of CPM/PERT is much 
simpler and does not contain conditional and loop structures. It is more suitable for 
describing manufacturing processes, whose component activities are usually determi- 
nistic. CPM/PERT is focused on the computation of process time and the tradeoffs 
between the time and cost, while in WS-workflows, other QoS metrics, such as fidel- 
ity and reliability, need to be supported. 



7 Conclusions 

In this paper, we have identified a number of QoS metrics for Web services and pro- 
posed a probability-based QoS model. A QoS metric of a (atomic or composite) Web 
service is described as a probability mass function. We have described algorithms to 
compute the QoS values of a WS-workflow from those of its constituent Web ser- 
vices. We introduced the problem of computing the least error PMF of a composite 
WS-workflow, and show that the search space is very large. We provided a dynamic 
programming formulation for the optimal solution, and an efficient approximation 
heuristic for it. Preliminary experimental results show that our proposed algorithm 
achieves high accuracy and is computationally efficient. 

The proposed model and framework can be used to estimate the QoS values of a 
WS-workflow at design time. However, we foresee the need of a WS-workflow 
monitoring service that can alert the owner or users of a WS-workflow about the 
possible violation of some QoS objective at the early stage. Such an alert function 
serves as an early notification of potential violation of the service level agreement 
(SLA) and allows the responsible entities to carry out compensatory activities. Al- 
though our QoS computation framework has been shown to be efficient, further op- 
timization is needed in order to meet the high efficiency requirements of the run-time 
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monitoring service. We are currently investigating the issues and techniques in im- 
plementing the monitoring service. 
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Abstract. Conditional schema changes change the schema of the tuples that sat- 
isfy the change condition. When the schema of a relation changes some tuples 
may no longer fit the current schema. Handling the mismatch between the in- 
tended schema of tuples and the recorded schema of tuples is at the core of a 
DBMS that supports schema evolution. We propose to keep track of schema mis- 
matches at the level of individual tuples, and prove that evolving schemas with 
conditional schema changes, in contrast to database systems relying on data mi- 
gration, are lossless when the schema evolves. The lossless property is a precon- 
dition for a flexible semantics that allows to correctly answer general queries over 
evolving schemas. The key challenge is to handle attribute mismatches between 
the intended and recorded schema in a consistent way. We provide a parametric 
approach to resolve mismatches according to the needs of the application. We 
introduce the mismatch extended completed schema (MECS) which records at- 
tributes along with their mismatches, and we prove that relations with MECS are 
lossless. 

1 Introduction 

Schema evolution occurs when the schema of a populated database is changed. After the 
schema of a relation has evolved some tuples no longer fit the schema. The mismatch 
between the intended schema of a tuple and the recorded schema of the tuple, i.e., 
the schema used to record the tuple in the database, is inherent to evolving schemas. 
Handling this mismatch is at the very core of a DBMS that supports schema evolution. 

The paper considers conditional schema evolution. A conditional schema change is 
an operation that changes the schema of the tuples that satisfy the condition. Conditional 
schema changes properly subsume regular (i.e., unconditional) schema changes, since it 
is always possible to have the condition select the entire extent of a relation. The main 
difference is that conditional schema evolution results in several current and equally 
important schemas. 

As a first step towards a foundation for conditionally evolving database schemata, 
we present a theoretical framework for conditional evolution at the level of the relational 
model. The framework is based on evolving schemas consisting of a set of schema 
segments (akin to versions) where each segment defines the intended schema of a subset 
of tuples. We show that in contrast to current commercial database systems evolving 
schemas are lossless models. The lossless property ensures that schema changes can be 
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© Springer- Verlag Berlin Heidelberg 2004 




Lossless Conditional Schema Evolution 



611 



rolled back and that tuple updates and schema changes are orthogonal operations, i.e., 
we never have to resort to data migration to deal with schema mismatches. 

After the schema of a relation has evolved the intended and recorded schema of 
some tuples are out of sync. The attribute mismatches between the intended and 
recorded schemas of tuples have to be resolved systematically to get sensible answers to 
queries. We suggest a parametric approach that resolves attribute mismatches according 
to the needs of the application. 

We propose the mismatch extended completed schema (MECS) which records both 
attributes and their corresponding attribute mismatches. We prove that relations with 
MECS are lossless evolution models. One of the salient features of such relations is that 
schema changes can be dealt with as standard tuple updates. We introduce parametric 
mismatch resolution of relations with MECS and establish the upper bound on its time 
complexity to be proportional to the size of the relation. 

2 Preliminaries 

2.1 Evolving Schemas 

An evolving schema , E = {Si, . . . , S n }, generalizes a relation schema and is defined 
as a set of schema segments. A schema segment , S = (A, P), consists of a schema 
A and a qualifier P. Throughout, we write Sis and Ps to directly refer to the schema 
and qualifier of segment S, respectively. As usual, a schema, A = {A ±, . . . , A n }, is 
defined as a set of attributes. For the purpose of this paper, no distinction is made be- 
tween schemas and sets of attributes. A qualifier P is either TRUE, FALSE, or a con- 
junction/disjunction of attribute constraints. An attribute constraint is a predicate of the 
form Adc or -<(Adc), where A is an attribute, 6 £ {<,<,=, >, >} is a comparison 

predicate, and c is a constant. An evolving schema may have segments with different 
schemas. Consequently, some tuples may be missing attributes that appear in other seg- 
ments. In order to evaluate attribute constraints on such tuples, Adc is an abbreviation 
for 3v(A/v £ t A v6c) where t is a tuple. Likewise, ->( Adc ) is an abbreviation for 
S(A/v € t A vdc). Note that this implies that the constraints ->(A = c) and A ^ c are 
not equivalent. 

A tuple t is a set of attribute values where each attribute value is an attribute/value 
pair: {A\/v\, . . . ,A n /v n }. The value must be an element of the domain of the at- 
tribute, i.e., if dom(A) denotes the domain of attribute A, then VA,v,t(A/v £ t => 
v £ dom(A)). A tuple t qualifies for a segment S, qual(t, S), iff t satisfies the quali- 
fier Ps . A tuple satisfies a qualifier, / ’(/;) . iff the qualifier is TRUE or the tuple makes 
the qualifier true under the standard interpretation. If a tuple t qualifies for a segment 
S in an evolving schema E, then Ms is the intended schema of t, i.e., Vi, S, E(S £ 
E A qual(t : S)) => is(t, E) = Ms)- A tuple t matches a segment S iff the schema 
of S and t are identical: matchft , S) iff VA(A £ Ms -O- 3 v(A/v £ t)). If a tuple t 
matches a segment S in the evolving schema E, then Ms is the recorded schema of t, 
i.e., Vi, S, E(S £ E A matchfit , S) => rs(t , E) = Ms). 

2.2 Conditional Schema Changes 

A conditional schema change is an operation that changes the set of segments of an 
evolving schema. The condition determines the tuples that are affected by the schema 
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change. A condition C is either TRUE, FALSE, an attribute constraint, or a conjunc- 
tion of attribute constraints. For the purpose of this paper we consider two conditional 
schema changes: adding an attribute, a(A, E, G), and deleting an attribute, f3(A, E, C ). 
An extended set of schema changes that includes mappings between attributes and a 
discussion of their completeness can be found elsewhere [9], 

a(A, E, C): An attribute A is added to the schemas of all segments that do not 
already include the attribute. For each such segment two new segments are generated: 
a segment with a schema that does not include the new attribute and a segment with a 
schema that includes the new attribute. Segments with a schema that already includes 
A are not changed. 

/3(A, E,C): The attribute A is deleted from the schemas of all segments that include 
the attribute. For each such segment two new segments are generated: a segment with 
a schema that still includes the attribute and a segment with a schema that does not 
include the attribute. Segments with a schema that does not include A are not changed. 

The precise formal definitions of conditional attribute additions and deletions are 
given in Figure 1 . 



a(A, 0, C) 
a(A,{(A,P)}UE,C) 



{(A,P)}Ua A (E,C) iff AeA 

{(.A U { A}, P A C ) , ( A, P A ^C)} U a a (E, C) iff A 04 



0 (A,9,C) 
/3(A,{(A, P)} U E, C) 



{( A,P)}U(3a(E,C ) iff A 04 

{(A \ {A}, P A C), (A, P A -C)} U /3 a (E, C ) iff Ae A 



Fig. 1. Adding (a(A, E, C )) and Deleting (/3(A, E, C)) Attribute A on Condition C 



2.3 Running Example 

Assume a schema that models students: Student = (AUe, Major, leva, Gade). The schema re- 
quires a name, major, level, and grade for each student. We consider two conditional 
schema changes. 

O:(s[/pervisor, Student, J evel = grad) (1) 

/J ( (Aide . Student, A/ajor t=t Ij'io) , QjjCredits, Student, AZajor l)i O) (2 ) 

The first conditional schema change assigns a supervisor to graduate students. There- 
fore, a sC/pervisor attribute is added to the schema of graduate students: (TVame, Major, level, Gade, 
sC/pervisor) . The schema (JVame, Major, ie v ei, Gade) remains valid for undergraduate students. 
Note that we are left with two current and equally important schemas. 

The second conditional schema change requires that a new credit system is used 
for students with a major in biology. The credit system replaces the old grading sys- 
tem. Thus, biology students need to be recorded with a Gedits attribute instead of a Gade 
attribute. Obviously, biology students may be enrolled as undergraduate or graduate 
students. Therefore, the schema change applies to the schema of graduate and under- 
graduate students. This yields an evolving schema with a total of four segments: 
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51 =({N, M, L, G}, -n(L = grad) A -.(M = bio)) 

5 2 =({N, M, L,G, U},L = grad A -.(M = bio)) 

53 =({iV, M, L, C}, -n(L = grad) A M = bio) 

5 4 =({N, M, L, C, U}, L = grad A M = bio) 

An example instance of the evolving Student schema is illustrated in Figure 2. Each tuple 
is shown as specified by the user, i.e., with values for the attributes of the recorded 
schema. The intended schema (is) is shown to the right. Note that only tuples 1 3 and f 5 
match on their recorded and intended schema. For the other three tuples the recorded 
and intended schema do not match. For example, tuple t± has the recorded schema 
rs = ( N , M, L, G) and the intended schema is = (N, M, L, C). 

tl = (N/john,M/bio,L/ugrad, G/9.2) is = (N,M,L,C) 
t2 = (N/anne,M/math,L/grad, G/8.7) is = (N,M,L,G,U) 

t3 = (N/tom,M/math,L/ugrad, G/5.9) is = (N.M,L,G) 

t4 = (N/kim,M/bio,L/grad, G/7.1,U/rick) is = (N,M,L,C,U) 
t5 = (N/rita, M/bio, L/grad, C/31,U/mike) is = (N,M,L,C,U) 

Fig. 2. An Instance of the Student Schema 



3 Lossless Schema Evolution 

This section develops a framework that can be used to decide whether an evolving 
database model is lossless. Intuitively, a model is lossless iff at each point it can be used 
to determine the intended schema of tuples. This also means that schema changes can 
be rolled back. We use the evolving schema from Section 2. 1 as a yardstick. Essentially, 
a lossless model must be able to determine the intended schema with at least the same 
degree of precision as the evolving schema. 

A key issue with evolving schemas is that after several schema changes the schema 
may no longer permit to correctly determine the intended schema. To see this assume 
a model that preserves deleted attributes so it is possible to roll back to previous states 
(this is a common technique that is also used by commercial database systems). In order 
to replace grades with credits we drop the grade attribute and add a credit attribute. With 
dropped attributes being preserved we end up with the schema (N, M, L, G , C). This is 
clearly not the intended schema, and without extra information it is impossible to figure 
out that the credit attribute is supposed to replace the grade attribute. Thus, the model 
is lossy. 

Evolving schema preserves qualification of tuples. Thus, tuples with an intended 
schema are guaranteed to also have an intended schema after the schema has evolved. 
Moreover, this intended schema is unique. The proofs have been omitted due to space 
consideration, but can be found in full detail in [ 10 ]. 

Lemma 1. Let E be an evolving schema, t be a tuple, and 7(2!, E, C) be a conditional 
schema change where 7 £ {a, ff}. Ift qualifies for a segment in E, then t also qual- 
ifies for a segment in 7 (A,E,C), i.e., \/E, S,t,j(S £ E A qual(t,S) =7 3S'(S' £ 
7 (A, E, C) A qual(t, S'))). 
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Lemma 2. Let E and E' be evolving schemas, and 7 £ {a, /?} be a conditional schema 
change such that 7 (A, E, C) = E' . Let t be a tuple that qualifies for a single segment 
in E, then the qualifying segment oft in E' is unique. 

Lemma 1 and 2 guarantee that the evolving schema always uniquely determines the 
intended schema of each tuple in a relation. This is required to accurately answer gen- 
eral queries over evolving relations (cf. Section 5). Moreover, for each tuple the evolv- 
ing schema provides an intended schema that is consistent with the schema changes 
applied, i.e., a conditional schema change only changes the intended schema of a tuple 
if the tuple satisfies the corresponding condition. 

We characterize an evolution model M = ( I), /’. is) by a database schema D, a set 
of schema change operations and a function is : t x D — > A that given a tuple t 
and a database schema D determines the intended schema of t. Each schema change 
operation 7 £ f is a function 7 : D x C D that given a database schema and a 
condition applies the conditional schema change to produce a new database schema. An 
evolution model that associates the same intended schemas with tuples as the evolving 
schema is lossless iff it continues to do so after a conditional schema change has been 
applied. 

Definition 1. (lossless) Let E be an evolving schema and M be an evolution model. 
Let t be a tuple and 7 (A, E, C) be a conditional schema change. M is lossless iff 

Vt, 7) C, E, M{ 

M = (D, r, is') A 7 £ LA 

is(t, E) = is'(t, D) => is(t, 7 {A, E, C )) = is'(t, 7 (A, D , C)) 

Example 1. Consider the evolution model M = (I), E, is) based on the completed 
schema. The completed schema D = A contains all attributes A ever introduced, 
i.e., only attribute additions change the schema. The schema change operations L on 
the completed schema are therefore defined as follows: a(A,A,C) = A U {A} and 
0(A, A , C) = A. A property of the completed schema is that the intended and actual 
schema of tuples are synchronized, i.e., is(t , A) = A. Clearly, the completed schema 
is not lossless, since attribute deletions do not change the intended schema of tuples as 
required by the evolving schema. 

4 Attribute Mismatches 

This section investigates the mismatch between the recorded and intended schema of 
tuples. We illustrate the four type of mismatches that may occur at the level of individ- 
ual attributes, and establish the relationship between conditional schema changes and 
attribute mismatches. 

A history H = [71 (Ai, E, C \ ), . . . , 7 n (A„, E, C n )} is a sequence of conditional 
schema changes where 7* £ {a, j 3 }. Any schema can be constructed by adding each at- 
tribute unconditionally. E.g. a(G, a(L, a(M, a(N, E 0 , true), true), true), true) 
constructs the initial student schema where Eq = {({}, true)} is an evolving schema 
with a single empty segment. We assume that segments with FALSE qualifiers are re- 
moved which will yield an evolving schema with a single segment having the intended 
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schema. It follows that any evolving schema can be constructed from a history by 
first constructing the initial schema and then adding in sequence the same conditional 
schema changes applied to the evolving schema. We write Eh to denote the evolving 
schema defined by H. 



4.1 Mismatches Types 

Let H be a history and Ah be the schema containing all attributes added by schema 
changes in H. Let t be a tuple with recorded schema A, and intended schema A, = 
is(t, H). Only attributes in Ah can be queried. Since A r C An and A, C Ah (other- 
wise t would not be a valid tupel for II ) an attribute A £ Ah causes one of four possible 
types of mismatches depending on its membership of A r and A t , respectively: 

- No mismatch (Mi : A £ A r A A £ AA The attribute appears in both the recorded 
and intended schema of a tuple. For example, for Kim (tuple f 4 in Figure 2) there 
is no mismatch for attributes N, M, E, and U. 

- Not recorded (M2 : A Ar A A £ A/) The attribute appears only in the intended 
schema of the tuple. These mismatches occur when schema changes add new at- 
tributes. For example, John (tuple t\ in Figure 2) was not recorded with a value for 
the Gedits attribute, which was added after the tuple was inserted into the database. 

- Not available (M3 : A £ A r A A £ A, ) A tuple is recorded with the attribute, but 
the attribute does not appear in the intended schema of the tuple. Mismatches of this 
kind are the result of attribute deletions. For example, John (tuple 1 1 in Figure 2) is 
recorded with a grade of 9.2, but according to the intended schema is not supposed 
to have a grade attribute. 

- Not applicable (M 4 : A (j A, A A (f A, ) The attribute neither appears in the 
recorded nor the intended schema of the tuple. Such mismatches occur e.g. for 
tuples that do not satisfy the condition for an attribute addition. For example, the 
Credits and sC/pervisor attributes are not available to Tom (tuple £3). 

Table 1 shows the attribute mismatches and attribute values (where available) for 
each tuple in Figure 2. Note that the recorded and intented schema of a tuple can be 
determined directly from the attribute mismatches of the tuple. The mismatch type of 
each attribute determines whether that attribute appears in both schemas (Mi type mis- 
matches), only in the intented schema (M 2 type mismatches), only in the recorded 
schema (M 3 type mismatches), or in neither schema (M4 type mismatches). 



Table 1 . Attribute Mismatches and Attribute Values for the Evolving Student Schema 



Nrnie 


Mijor 


Level 


Gade 


Gedits 


sLJ>ervisor 


Mi/john 
Mi / anne 
Mi / tom 
Mi / kim 
Mi / rita 


Mi /bio 
Mi /math 
Mi /math 
Mi /math 
Mi /bio 


Mi/ugrad 
Mi / grad 
Mi/ugrad 
Mi / grad 
Mi /grad 


M 3 / 9.2 

AA/8.7 

Mi/5.9 

Ms/7.1 

m 4 


m 2 

m 4 

m 4 

m 2 

Mi/31 


m 4 

m 2 

m 4 

Mi /rick 
Mi /mike 
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4.2 Mismatches and Conditional Schema Changes 

Conditional schema changes change the set of attributes in the intended schema of 
tuples. Because the intented schema can be determined from the set of attribute mis- 
matches, conditional schema changes also change the attribute mismatches of tuples. 

The following two lemmas establish this relationship between attribute mismatches 
and conditional schema changes. 

Let A be an attribute, t be a tuple, H be a history, and M, be an attribute mismatch 
with i € {1, . . . , 4}. If the attribute mismatch between t and is(t, Eh) on attribute A 
is Mi, then m(A, t, Eh) = M . Note that m(A, t, 0) is M 3 iff 3v{A/v € t) otherwise 
M 4 ,, since the intended schema of t is empty. 



Lemma 3. Let H be a history and t be a valid tuple of Eh with m(A , t , Eh) = M,. 
Let a(A, Eh, C ) be a conditional attribute addition, then 



m(A,t, a(A, E H ,C)) 



Mi — 2 iffi& {3, 4} A C(t) 
Mi otherwise 



Lemma 4. Let H be a history and t be a valid tuple of Eh with m(A , t, Eh) = M t . 
Let (3(A, Eh, C) be a conditional attribute deletion, then 



m(A, t, (3{A, E h ,C)) 



M . l+2 iff i e {1,2} A C(t) 
Mi otherwise 



5 Mismatch Resolution 

When querying an evolving database, the DBMS has to systematically resolve attribute 
mismatches. We discuss three sensible and intuitive policies to resolve attribute mis- 
matches at the level of individual attributes (it would be easy to add other policies). 

- Projection: Resolves the mismatch by using the recorded attribute value. Clearly, 
this is only possible if the attribute appears in the recorded schema of the tuple. 
Therefore, projection can only be used to resolve Mi and M 3 attribute mismatches. 

- Replacement: Resolves the mismatch by replacing the (missing) attribute value 
with a specified value. 

- Exclusion: Resolves the mismatch by excluding the tuple entirely for the purpose 
of the query. 

To illustrate the resolution policies we provide a series of examples. 

Example 2. Assume we want to count the number of students who got assigned a 
supervisor although they are not intended to have one. This means that we need to 
count the supervisors of tuples with an M 3 mismatch for the supervisor attribute. The 
query n[cnt(U)\(studerits) together with the policies Ml exclusion, M2:exclusion, 
M3:projection, and M4:exclusion answers the query, since only tuples with M 3 mis- 
matches are included after resolution of the policies. The result is 0 (cf. Table 1). 
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Example 3. Assume we want to print all the grades that ever have been assigned. This 
means we also want to see the grades of students who got a grade before they transi- 
tioned to the credit system. With the policies Ml :projection, M2:exclusion, M3:projec- 
tion, and M4:exclusion for Gade, the query n [G\ (students) answers the query. After 
resolution only tuples with an actual value of the Gade attribute are included. The result 
is {9.2, 8.7, 5.9, 7.1}. 

Example 4. Assume we want to count the number of students who are supposed to 
have a supervisor but have not got assigned one yet. With the policies: Ml:exclusion, 
M2:replacement, M3:exclusion, and M4:exclusion for supervisor, n[cnt(U)\ ( students ) an- 
swers the query. Only tuples with M 2 mismatches are included after resolution of the 
policies. Since M 2 mismatches indicate that the tuples have no actual value stored, a 
replacement policy is used to introduce a value that can be counted by the query. The 
result is 1 . 

Note the generality of our approach. The key advantage of the proposed resolution 
approach is that it decouples schema definition and querying phases. This means that 
the above examples do not depend on the specifics of the database schema. For example, 
the exact reason why someone should (not) have a supervisor does not matter. Queries 
and resolution policies are conceptual solutions that do not depend on the conditional 
schema changes. This is a major difference to approaches that exploit the conditions of 
the schema changes (or other implicit schema information) to formulate special purpose 
queries to answer the example queries. 

6 Mismatch Extended Completed Schema 

In this section we introduce the mismatch extended completed schema (MECS). To sup- 
port conditional schema evolution and policies, the DBMS must be able to determine 
and maintain the recorded and intended schema of tuples. We show that relations with 
MECSs (referred to as evolving instances) can accomplish this task, and give the algo- 
rithms to perform both conditional schema evolution and mismatch resolution. Finally, 
we show that an evolving instance is a lossless evolution model. 

A MECS is a schema {Ai , . . . , A n , Mi, . . . , M n } where for each attribute Ai there 
is an attribute Mi recording the attribute mismatch of Ai. The domain of Mi indicate 
the current type of attribute mismatch of Ap. 1 (no mismatch), 2 (not recorded), 3 (not 
available), and 4 (not applicable). MECSs are used by evolving instances to record both 
the attribute values and the attribute mismatches of tuples: 

Definition 2. (evolving instance) Let H be a history and {A \, . . . , A n } be the schema 
containing all attributes added by schema changes in H. Let I be a relation with the 
MECS (Ai, . . . , A n , Mi , . . . , M n ) as its schema. If, for all tuples t € I and all i £ 
{1, . . . , n}, Mi is the attribute mismatch given by m{Ai,t , Eh), then I is an evolving 
instance of H. 

Table 2 shows the evolving instance for the Student example. 

In Section 4 we showed that the attribute mismatches of a tuple encode both the 
recorded and the intended schema of that tuple. Moreover, conditional schema changes 
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Table 2. The Evolving Student Instance 



N 


M 


L 


G 


C 


U 


Mn 


Mm 


M l 


M a 


M c 


Mu 


john 


bio 


ugrad 


9.2 






1 


1 


1 


3 


2 


4 


anne 


math 


grad 


8.7 






1 


1 


1 


1 


4 


2 


tom 


math 


ugrad 


5.9 






1 


1 


1 


1 


4 


4 


him 


math 


grad 


7.1 




rick 


1 


1 


1 


3 


2 


1 


rita 


bio 


grad 




31 


mike 


1 


1 


1 


4 


1 


1 



can be applied directly to the attribute mismatches. It is therefore sufficient for a DBMS 
to maintain the attribute mismatches of tuples instead of their intended and recorded 
schemas. This is desireable for two reasons. First, to apply policies the DBMS needs 
to determine the attribute mismatches of tuples anyway. Second, while the mismatch 
type for an attribute may differ between tuples, each tuple in the same evolving relation 
has defined a mismatch type for the exact same set of attributes; namely one for each 
attribute added by conditional schema changes. The attribute mismatches of all tuples 
can therefore be recorded in a single relation. In contrast, tuples have different recorded 
and intended schemas with respect to both number and composition of attributes. 

By definition, an evolving instance records the tuples and all their attribute mis- 
matches. The missing attribute values in Table 2 occur only for attributes with mis- 
match type M 2 or A/ 4 . We require that attribute mismatches are resolved before queries 
are answered, and since the projection policy is only applicable to attribute mismatches 
of type Mi and M3, the missing attributes in an evolving instance are never directly 
accessed, so their content is irrelevant. 

6.1 Applying Conditional Schema Changes to Evolving Instances 

Lemma 3 and 4 from Section 4.2 establish the relationship between conditional schema 
changes and attribute mismatches. The operations for conditional attribute addition and 
deletion on evolving instances are based on those two lemmas. The main point is that 
conditional schema changes do not modify the schema of the evolving instance, but 
rather update the attribute values for the mismatch attributes of tuples satisfying the 
change condition. Conditional schema changes can therefore be handled by the DBMS 
in the same way as standard tuple updates. 

Figure 3 gives the formal definitions of conditional addition and deletion of an at- 
tribute Ai for a tuple t in an evolving instance. Both operations assume that the attribute 
A-t is in the schema of the evolving instance. Intuitively, deleting an attribute that does 
not appear in the schema has no effect. Moreover, if we consider the set of all possible 
attributes A, then any schema is a subset of A. The MECS of an evolving instance I 
contains exactly all the attributes (and their corresponding mismatches) added by con- 
ditional schema changes applied to I. Therefore, only attributes in the MECS of I can 
appear in the intended schemas of tuples in I. The recorded schemas of tuples in I are 
similarly bounded. This means that for any attribute A £ A that does not appear in 
the MECS of I, the attribute mismatch is M 4 for all tuples in /. According to Figure 3 
applying an attribute deletion to tuples with an M 4 mismatch type for that attribute has 
no effect. 
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a(Ai,t, C) 



(t \ {Mi/vi}) U {Mi/vi - 2} iff t[Mi\ E {3,4} A C(t) 
t otherwise 



P{Ai, t, C) 



( t \ {Mi/vi}) U {Mi/vi + 2} iff t[Mi] € {1, 2} A C(t) 
t otherwise 



Fig. 3. Attribute Addition and Deletion Operations on Tuples in Evolving Instances 



When adding an attribute A n+ 1 that does not already appear in the evolving instance 
I, we have to extend the MECS of I with that attribute and its corresponding mismatch 
attribute M n + i, before applying the attribute addition: 

aA n+1 (t,C) = a An+1 (t\J{A n+1 /u,M n+1 /A},C) iff A n+1 /v n+1 £ t 

Note that the mismatch type for the new attribute is M 4 (as described in the previous 
paragraph) and oj can be any value in the domain of the attribute (which value is irrele- 
vant as it will never be used). Operations that add a new column (either empty or with 
a default value) to an existing and populated table are already supported by commercial 
DBMSs such as Oracle9. 

Theorem 1. An evolving instance is lossless. 



6.2 Mismatch Resolution for Evolving Instances 



The three policies presented in the previous section are functions that resolve attribute 
mismatches at the level of individual attributes. 

Let I be an evolving instance and t be a tuple in I. Let A t be an attribute in the 
schema of I and M be a mismatch type, then projection, replacement, and exclusion 
policies are functions defined as follows: 



fproj (l- A j , M) 



undef iff t[Mi] £ {2,4} 
t otherwise 



fc (t A ■ Aft - / (* \ { A i/ V i)) u { A i/ C ) iff tWi] = M 
■’repA 1 i otherwise 



fexcl{t, Ai, M) 



0 iff t [Mi] = M 
t otherwise 



An attribute policy P = (A, [/1, . . . , /„]) where £ {f pro j, frepvfexd} specifies 
a policy for each of the four mismatch types of A, i.e., / 4 are used to resolve 

the attribute mismatches of A of type Mi, . . . , M4, respectively. Intuitively, an attribute 
policy specifies how the DBMS resolves all the attribute mismatches of a given attribute. 
We write P A as a shorthand for P = (A, [fi, fa]). 

We can now define the mismatch resolution p[P A ]I of an evolving instance / given 
attribute policy P A . 
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Definition 3. (mismatch resolution) Let I be an evolving instance. At be an attribute 
in the MECS of I, PAi be the attribute policy of A,, and t be a tuple in I. Then the 
mismatch resolution p is given by: 

p[PAi)I = {t'\t e I A M = t[Mi } A f = f M (t, A. t ,M)} 

Intuitively, for each tuple in the evolving instance the mismatch resolution considers 
the mismatch type of attribute A, t and applies the corresponding policy function (/} for 
Mi mismatches, /2 for M2 mismatches, etc.) to derive the resolved tuple f . 

To illustrate mismatch resolution and attribute policies, we review the examples 
from Section 5. 

Example 5. Example 2 resolves mismatches with the attribute policy Pjj = ( U , [fexci, 
fexci , fproj , fexci] ■ Table 2 contains no tuples with 3 as their Mu attribute value, so 
all tuples are excluded by the mismatch resolution. The result to n[cnt(U)]I student is 
therefore 0. 

Example 6. Example 3 uses an attribute policy on G: Pq = ( G , [fproj , fexci , fproj , 
fexci]), excluding tuples with M2 and M4 mismatches for attribute G. Mismatch reso- 
lution results in Table 3. The answer to the query n [G\I student is {9.2, 8.7, 5.9, 7.1}. 

Table 3. Mismatch Resolution for Example 3 



N 


M 


L 


G 


C 


u 


Mn 


Mm 


M l 


M g 


M c 


Mu 


john 


bio 


ugrad 


9.2 






1 


1 


1 


3 


2 


4 


anne 


math 


grad 


8.7 






1 


1 


1 


1 


4 


2 


tom 


math 


ugrad 


5.9 






1 


1 


1 


1 


4 


4 


kirn 


math 


grad 


7.1 




rick 


1 


1 


1 


3 


2 


1 



Example 7. Example 4 uses the same query as Example 2, but resolves mismatches 
according to a different attribute policy: P v = ( U , [fexci, f 3 r t P v f^xd, fexd])- The result 
is shown in Table 4. The policy has replaced the missing s (7 P er™or attribute value with jd 
(John doe) for all tuples with M 2 mismatches, and excluded all other tuples. The result 
of the query is 1 . 



Table 4. Mismatch Resolution for Example 4 



N 


M 


L 


G 


C 


u 


Mn 


Mm 


M l 


M g 


M c 


Mu 


anne 


math 


grad 


8.7 




jd 


1 


1 


1 


1 


4 


2 



Mismatch resolution can apply any number of attribute policies (on different at- 
tributes) to an evolving instance at the same time (cf. Lemma 5). 

Lemma 5. Let Pa x , . . . , Pa„ be policies on different attributes. Let I be an evolving 
instance and t be a tuple in I. Then: 



p[P Al ] ■ ■ ■ p[PaJI = {t'\t € I At' =p Al X ... x p Am {{t}) 
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Lemma 5 also establishes the upper bound on time complexity of mismatch reso- 
lution of evolving instances to be proportional to the size of the evolving instance, i.e., 
0 ( 1 * 1 ). 

For query purposes, this corresponds to resolving all attribute mismatches and then 
apply the query to the resolved relation. For applications with fixed policies, a view can 
be created for the resolved relation. However, for applications using ad hoc attribute 
policies, we want to minimize the number of attributes and tuples that have to be re- 
solved in order to answer a given query. 



7 Related Work 

Conditional schema evolution has been investigated in the context of temporal data- 
bases, where proposals have been made for the maintenance of schema versions along 
one [13, 17, 20] or more time dimensions [6], Because schema change conditions are re- 
stricted to one or two time recording attributes the exponential explosion of the number 
of schemas segments are avoided. 

In temporal databases schema evolution has been analyzed in the context of tempo- 
ral data models [7, 1], and schema changes are applied to explicitly specified versions 
[22,5]. This requires an extension to the query language and forces schema semantics 
(such as missing or inapplicable information) down into attribute values [18, 14]. In or- 
der to preserve the convenience of a single global schema for each relation null values 
have been used [14, 8]. In particular, it is possible to use inapplicable nulls if an attribute 
does not apply to a specific tuple [2,3, 11, 12]. This leads to completed schemas [18] 
with an enriched semantics for null values. The approach does not scale to an ordered 
sequence of conditional evolution steps with multiple current schemas. It is also insuf- 
ficient for attribute deletions if we do not want to overwrite (and thus loose) attribute 
values. In response to this it has been proposed to activate and deactivate attributes 
rather than to delete them [19]. 

Unconditional schema evolution has also been investigated in the context of OODBs, 
where several systems have been proposed. Orion [4], CLOSQL [15], and Encore [21] 
all use a versioning approach. Typically, a new version of the object instances is con- 
structed along with a new version of the schema. The Orion schema versioning mech- 
anism keeps versions of the whole schema hierarchy instead of the individual classes 
or types. Every object instance of an old schema can be copied or converted to become 
an instance of the new schema. The class versioning approach CLOSQL provides up- 
date/backdate functions for each attribute in a class to convert instances from the format 
in which the instance is recorded to the format required by the application. The Encore 
system provides exception handlers for old types to deal with new attributes that are 
missing from the instances. This allows new applications to access undefined fields of 
legacy instances. In general, the versioning approach for unconditional schema changes 
cannot be applied to conditional schema changes, because the number of versions that 
has to be constructed grows exponentially. 

Views have been proposed as an approach to schema evolution in OODBs [23]. 
[16] propose the Transparent Schema Evolution approach, where schema changes are 
specified on a view schema rather than the underlying global schema. They provide 
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algorithms to compute the new view that reflects the semantics of the schema change. 
The approach allows for schema changes to be applied to a single view without affecting 
other views, and for the sharing of persistent data used by different views. 

8 Summary and Future Research 

The paper defines the lossless property for general models of evolving schemas, and 
show that the solutions offered by current commercial database systems are not loss- 
less. The main problem is to resolve the mismatches between the recorded and intended 
schemas of tuples systematically. We propose a parametric approach where mismatches 
are resolved according to the needs of the query. We propose the mismatch extended 
completed schema (MECS) which records both attributes and their mismatches. We 
show that relations with MECS (called evolving instances) are lossless, and by exploit- 
ing the relationship between attribute mismatches and conditional schema changes we 
can treat conditional schema changes as tuple updates. Finally, we establish an upper 
bound on the cost of parametric mismatch resolution of evolving instances. 

Ongoing and future work includes the optimization of mismatch resolution. Specif- 
ically, we investigate techniques to minimize the amount of data that has to be resolved. 
By defining mismatch resolution as an algebraic operator, we are currently developing 
algebraic transformation rules to be used by the query optimizer. Preliminary results 
indicate that mismatch resolution can be delayed until the recorded attribute values are 
required by other operators which can substantially reduce the cost of resolution. 



References 

1 . G. Ariav. Temporally Oriented Data Definitions: Managing Schema Evolution in Temporally 
Oriented Databases. Data Knowledge Engineering , 6(6):45 1 — 467, 1991. 

2. P. Atzeni and V. de Antonellis. Relational Database Theory. Benjamin/Cummings, 1993. 

3. P. Atzeni and N. M. Morfuni. Functional dependencies in relations with null values. Infor- 
mation Processing Letters, 18(4):233-238, 1984. 

4. J. Banerjee, W. Kim, H.-J. Kim, and H.F. Korth. Semantics and Implementation of Schema 
Evolution in Object-Oriented Databases. In ACM SIGMOD International Conference on 
Management of Data, pages 311-322. ACM Press, 1987. 

5. C.D. Castro, F. Grandi, and M.R. Scalas. On Schema Versioning in Temporal Databases. In: 
Recent Advances in Temporal Databases. Springer, 1995. 

6. C.D. Castro. F. Grandi, and R.R. Scalas. Schema Versioning for Multitemporal Relational 
Databases. Information Systems, 22(5):249-290, 1997. 

7. J. Clifford and A. Croker. The Historical Relational Data Model (HRDM) and Algebra based 
on Lifespans. In 3rd International Conference of Data Engineering, Los Angeles, California, 
USA, Proceedings, pages 528-537. IEEE Computer Society Press, 1987. 

8. G. Grahne. The Problem of Incomplete Information in Relational Databases. In Springer 
LNCS No. 554, 1991. 

9. O. G. Jensen and M. H. Bohlen. Evolving Relations. In Database Schema Evolution and 
Meta-Modeling, volume 9th International Workshop on Foundations of Models and Lan- 
guages for Data and Objects of Springer LNCS 2065, page 1 15 ff., 2001. 

10. O.G. Jensen. Multi-Dimensional Conditional Schema Evolution in Relational Databases. 
PhD thesis, Aalborg University, 2004. 




Lossless Conditional Schema Evolution 



623 



11. A. M. Keller. Set-theoretic problems of null completion in relational databases. Information 
Processing Letters, 22(5):26 1 — 265, 1986. 

12. N. Lerat and W. Lipski. Nonapplicable Nulls. Theoretical Computer Science, 46:67-82, 
1986. 

13. L.E. McKenzie and R.T. Snodgrass. Schema Evolution and the Relational Algebra. Informa- 
tion Systems, 15(2):207-232, 1990. 

14. R. van der Meyden. Logical Approaches to Incomplete Information: a Survey. In: Logics for 
Databases and Information Systems ( chapter 10). Kluwer Academic Publishers, 1998. 

15. Simon R. Monk and Ian Sommerville. Schema Evolution in OODBs using Class Versioning. 
SIGMOD Record, 22(3): 16-22, 1993. 

16. Young-Gook Ra and Elke A. Rundensteiner. A transparent object-oriented schema change 
approach using view evolution. In Philip S. Yu and Arbee L. P. Chen, editors. Proceedings 
of the Eleventh International Conference on Data Engineering, March 6-10, 1995, Taipei, 
Taiwan, pages 165-172. IEEE Computer Society, 1995. 

17. J.F. Roddick. SQL/SE - A Query Language Extension for Databases Supporting Schema 
Evolution. ACM SIGMOD Record, 21(3): 10-16, 1992. 

18. J.F. Roddick. A Survey of Schema Versioning Issues for Database Systems. Information 
Software Technology, 37(71:383-393, 1995. 

19. J.F. Roddick, N.G. Craske, and T.J. Richards. A Taxonomy for Schema Versioning based on 
the Relational and Entity Relationship Models. In 12th International Conference on Entity- 
Relationship Approach, Arlington, Texas, USA, December 15-17, 1993, Proceedings, pages 
137-148. Springer- Verlag, 1993. 

20. J.F. Roddick and R.T. Snodgrass. Schema Versioning. In: The TSQL92 Temporal Query Lan- 
guage. Noewell-MA: Kluwer Academic Publishers, 1995. 

21 . Andrea H. Skarra and Stanley B. Zdonik. The Management of Changing Types in an Object- 
Oriented Database. In OOPSLA, 1986, Portland, Oregon, Proceedings, pages 483^195, 
1986. 

22. R.T. Snodgrass et al. TSQL2 Language Specification. ACM SIGMOD Record, 23(1), 1994. 

23. Markus Tresch and Marc H. Scholl. Schema transformation without database reorganization. 
SIGMOD Record, 22(l):21-27 , 1993. 




Ontology-Guided Change Detection 
to the Semantic Web Data 



Li Qin and Vijayalakshmi Atluri 

MSIS Dept, and Center for Information Management, Integration and Connectivity (CIMIC) 

Rutgers University 

{liqin, atluri} @cimic . rutgers . edu 



Abstract. The Semantic Web is envisioned as the next generation web in which 
data instances are enriched with metadata defined in ontologies to describe the 
meaning of its instances. In this paper, we present an approach that exploits on- 
tologies in guiding the change detection to their data instances. Inference rules 
are identified based on the semantic relationships among concepts, properties 
and instances as well as their change behaviors. Starting with changes to some 
seed instances, a reasoning engine is designed to fire the pre-defined rule set 
and act on ontologies to project some semantically associated concepts as target 
concepts. Certain instances of these target concepts are further selected as target 
instances, which have a high likelihood of having changed. Our approach is 
specifically oriented toward the Semantic Web, thus it has intelligence to ex- 
ploit the semantic associations among data instances and make smart decisions. 



1 Introduction 

Change detection is to find out whether and what changes have occurred to data of 
interest, especially those owned and updated by autonomous sources. For example, a 
search engine has to detect changes to data published by autonomous sources in order 
to synchronize its local copies and its index with their sources. The major challenge 
confronting change detection lies in the conflict between limited availability of re- 
sources for change detection and the enormity of the data available. As an extension 
of WWW, the Semantic Web [2] will continue to be decentralized with its informa- 
tion space projected to increase at a much faster pace than resource availability. 
Therefore, change detection as well as management of versions and changes will be 
essential to data warehouses, search engines, cache maintenance and knowledge ar- 
chival applications for the Semantic Web. Earlier approaches on change detection to 
web pages rely on the link structure among web pages or statistics estimated offline 
such as their change frequencies [7,11]. In this paper, we present an approach that 
exploits the ontologies, in particular, the relationships among concepts, in guiding the 
change detection to their data instances. 

Under the Semantic Web, data instances are enriched with metadata, which de- 
scribes the meaning of its instances. The metadata used for annotating data instances 
are defined as concepts and properties in ontologies so that data instances can be in- 
terpreted and processed by machines. Semantic Web technologies, e.g. RDF [12], 
RDF Schema [13] and OWL [9], provide methods and standards that enable abstract- 
ing from syntactic idiosyncrasies into semantically meaningful description of data and 
services, accurate access to information as well as flexibility to comply with the needs 
of users or agents. It has promised flexibility, scalability and quality in data and ser- 
vice provision, which the current web cannot possibly achieve. 
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The Semantic Web will no longer be about pages and links, but semantic relation- 
ships between things. Under this context, we present an approach that exploits on- 
tologies in guiding the change detection to their data instances. Our semantics-based 
approach gives higher priority to data instances related to the instances to which 
changes have been recently found, as it believes those semantically related data are 
more likely to have changed as the result of the efforts by the same or even different 
information sources to maintain freshness and consistency. An inference engine is 
designed to reason on the basis of some detected changes and ontologies, and to make 
intelligent decisions on what to visit next so that changes to the Semantic Web data 
can be detected in a more efficient way. To this end, the inference engine needs to 
take advantage of inferences rules, which are identified based on the ontologies and 
specified among concepts, properties, and instances of concepts as well as their 
changes. Given changes detected to seed instances, these rules are essentially used by 
the reasoning engine to generate a profile for locating the target instances that are 
likely to have changed and therefore should be visited. The profile for the target in- 
stances also contains target concepts that target instances belong to, and target proper- 
ties that target instances should or should not instantiate based on the change type. 

With our ontology-guided approach, not only more changes may be detected, but 
also these changes are more semantically related. Changes detected to related, but 
independent sources may reveal something interesting not discovered by observing 
each change separately. For example, if consistent changes are witnessed to multiple 
independent sources, this may increase the trust-worthiness of the detected changes so 
that they can be trusted to identify more changes. Change detection can ultimately 
become a focused, well-controlled process in the sense that, instead of visiting pages 
in a blind way, it targets more accurately the pages it wants to visit, e.g. those related 
to a specific topic. Besides, techniques for our semantics-based approach can provide 
insights into utilizing semantic association in other applications such as guiding in- 
formation discovery for agents, consistency maintenance among distributed informa- 
tion sources, and so on. 

In summary, our work is an attempt to design intelligent change detection tech- 
niques guided by ontologies, combining inference and search for target instances 
under the infrastructure of the Semantic Web. Other related work on change detection 
includes studies on the change dynamics of web pages [3,5,6] and diff algorithms 
developed to accommodate data formats such as HTML and XML [4,8,14], Com- 
pared to the earlier proposals, our approach is specifically oriented toward the Seman- 
tic Web, thus it has intelligence to exploit the semantic associations among data in- 
stances and make smart decisions in an ad-hoc manner with no assumptions about the 
contents of data or changes. 

This paper is organized as follows. Section 2 is an overview of our approach. We 
present preliminaries in section 3, and elaborate the types of changes to the Semantic 
Web data and profiles in section 4. Section 5 and 6 present inference rules along with 
our intelligent change detection. Section 7 provides conclusions and our future work. 

2 Overview of the Proposed Approach 

Our approach begins with identifying different types of inference rules based on cer- 
tain semantic relationships in ontologies and these rules are exploited by the reason- 
ing engine in guiding the change detection process. Our change detection approach is 
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ontology-guided since all these rules act based on the ontologies. Essentially, the 
change detection process takes a set of seed instances to which changes have been 
detected, and derives a set of target instances which should also have changed. We 
identify five categories of inference rules: (1) Change Inference Rules: Given 
changes to seed instances, these rules imply changes to their semantically related 
instances. (2) Profile Inference Rules: Once any change inference rule is fired, i.e. 
changes to certain semantically related instances are implied, the corresponding 
profile inference rule is used to derive a profile for these instances with descriptions 
of target concepts possibly in terms of eq[c] and sub[c], target instances in terms of 
eq[i] and da[i], and target properties in terms of eq[p] and sup[p] (These operators are 
formally introduced in section 5.). (3) Concept Inference Rules: If the description of 
target concepts contains eq[c] or sub[c], concept inference rules are called to derive 
the specific target concepts. (4) Property Inference Rules: If the description of target 
properties contains eq[p] or sup[p], property inference rules are called to derive the 
specific target properties. (5) Instance Inference Rules: If the description of target 
instances contains eq[i] or da[i], instance inference rules are called to derive the spe- 
cific target instances. 




Fig. 1 . Information used by the reasoning engine 



Fig. 1 shows the components involved in the reasoning. It starts with some seed in- 
stances and finds changes to them. These changes become input to the change infer- 
ence rules, as indicated by step 1 in Fig. 1. Assume that the webpage for the MSIS 
Dept, of Rutgers, which is an instance of concept ‘Department’, is visited and the 
‘address’ of the department is found updated. Given these change(s), the reasoning 
engine comes into play by checking the change inference rules and fires those rele- 
vant to the detected change(s) and the ontologies, shown as step 2 in Fig. 1. Let us 
assume one of the change inference rules is stated as follows: “If dependency 1 exists 
from any property of the semantically associated concepts to this ‘address’ property, 
then instances specifically associated with this department instance should have that 
property value changed.’’ Then, the reasoning engine visits the ontology this depart- 
ment instance points to, and finds out ‘address’ property is defined to concept ‘Aca- 



1 If dependency exists from property p. of concept c. to p. of concept c., where c. and c. are 
directly associated, then when the value to p. of an instance of c. has changed, the value to Pj 
of the directly associated instances of c, should also have changed. 
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demic Unit’, of which concept ‘Department’ is defined as a specialization, concept 
‘Employee’ associates itself with ‘Academic Unit’ through object property ‘works- 
For’, and dependency exists from the ‘business address’ property of ‘Employee’ to 
the ‘address’ property of ‘Academic Unit’. This means if the ‘address’ property of an 
instance of ‘Department’ has changed, then the value to the ‘Business Address’ of all 
the instances of ‘Employee’ directly associated with this ‘Department’ instance 
should also have changed. As a result, the above change inference rule is fired. Since 
the fired change inference rule implies possible changes to some semantically related 
instances, the corresponding profile inference rules are triggered, indicated by step 3 
in Fig. 1. In our example, the profile inference rule for semantically related instances 
by dependency is triggered. Shown as step 4 in Fig. 1, the profile inference rules gen- 
erate a profile for the target instances, which have a high likelihood of having been 
changed, therefore, should be visited next. This profile consists of target concepts, 
target instances and target properties with each set described by certain operators we 
define. In our example, if concept ‘Employee’ has subclass concepts ‘Faculty’, 
‘Staff”, ‘Ph.D Student’ with each possibly having their own subclass concepts, then 
the target concepts will contain the operator for subclass concepts as sub(Employee). 
The operators for concepts, instances and properties in the profile will call concept 
inference rules, instance inference rules and property inference rules, respec- 
tively, to derive their specific elements. For our example, the concept inference rules 
will be called to derive the specific concepts constituting sub(Employee), which in- 
cludes concepts ‘Faculty’, ‘Staff, ‘Ph.D Student’ as well as their subclass concepts if 
exist. After that, the reasoning engine finally derives a profile for the target instances 
to be the instances of concepts in sub(Employee), which are directly associated with 
the MSIS Department by taking it as the value to its ‘worksFor’ property and have its 
‘business address’ instantiated. The profile is used to locate the actual target instances 
if the profile contains or is satisfiable by certain instances, shown as step 5 in Fig. 1 . 
The target data instances for our example may be located in the personal web pages of 
the department’s faculty members, the faculty list page on the department’s web site, 
and so on. 

3 Preliminaries 

Ontologies: "An ontology defines the terms used to describe and represent an area of 
knowledge." [10] Generally, an ontology defined for a domain contains a description 
of important concepts in the domain, properties of each concept as well as restrictions 
or axioms upon properties. Fig. 2 is an example of an ontology (the part above the 
horizontal line), available at http://cimic.rutgers.edu/ontologies/university, and some 
instances (the part below the horizontal line). The dashed lines across these two parts 
indicate the correspondence between the instances and ontologies. 

An ontology o ; consists of the following elements: (1) Concepts: The set of con- 
cepts C[o j ]={c | ,c-,,. . ,,c n }. C[http://cimic.rutgers.edu/ontologies/university]={ Student, 
Ph.D Student, Course, Faculty] for the ontology in Fig. 2. (2) Properties: V ceC[o ; ], 
there exists a set of properties P[c] = DP[c] tj OP[c] where DP[c] = [dpj,dp,,. . .,dp m ] 
are datatype properties of concept c, each taking a primitive data type as the value, 
and OP[c] = [op | ,op-,,...,op n j are object properties, each taking some concept(s) as 
the value. We use dPj[c] to represent the value taken by the datatype property dpj of 
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concept c, and opj[c] to denote the concept(s) taken by object property op; of concept 
c. In Fig. 2, P[Ph.D Student] = {subClassOf, quaExamDate, advisedBy], with 
DP[Ph.D Student] = {quaExamDate} where quaExamDate[Ph.D Student] = xsd:date 
and OP[Ph.D Student] ={subClassOf advisedBy}. Object properties can be domain- 
independent or domain-specific. While the domain-independent object properties 
have pre-defined meaning that does not vary from one ontology to another, the mean- 
ing of domain-specific object properties depends on the context in which it is defined. 
We use OP DI [c] and OP DS [c] to denote the domain-independent and domain-specific 
object properties of concept c, respectively. For example, OP D1 [Ph.D Student] = 
[subClassOf] where subclassOf[ Ph.D Student] = Student and OP DS [Ph.D Student] = 
{advisedBy} where advisedBy[Ph.D Student] = Faculty. (3) Restrictions: For all 
peDPfc] or peOP DS [c] where ceQoJ, there exists a set of restrictions on the value 
or cardinality of the property, denoted by R[p] = { r n---.r w } where r k [p] is the value to 
the restriction r k of property p. For example, R|quaExam Date] = { maxCardinality } 
where maxCardinality[quaExamDale\ = 2. (4) Axioms: For all peDP[c] or 
pe OP DS [c] where ce C[Oj], there exists a set of axioms with each defined by itself 
(unary) or in relation to another property (binary), denoted by A[p]= {a 1 ,a 2 ,...a n ].Ifa i 
is binary, ajp] denotes the property related to p through axiom a ; . Fig. 2 shows that 
inverseOjl advisedBy] = supervises, where supervises[Faculty] = Ph.D Student. 




Fig. 2. Example for ontology and instances 



Different ontology languages may support different types of domain-independent 
object properties, restrictions and axioms upon properties. For instance, OWL sup- 
ports domain-independent object properties such as subClassOf, equivalentClass, 
restrictions such as cardinality, and axioms such as subPropertyOf, equivalent Prop- 
erty, TransitiveProperty, SymmetricProperty, FunctionalProperty, InverseFunction- 
alProperty, inverseOf, and so on. Though our approach is not limited to any specific 
ontology language, we will resort to the OWL vocabulary [9] to simplify our discus- 
sion. 

Instances: Ontologies provide interpretations to the content of the Semantic Web data 
since ontologies define and relate concepts used to annotate the web data, which are 
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instances to the concepts in ontologies. Different elements of ontologies serve differ- 
ent functions: concepts along with their datatype properties and domain-specific ob- 
ject properties are what can be instantiated by instances, restrictions upon properties 
specify what requirements instantiation should satisfy in order to be valid, and do- 
main-independent object properties and the axioms of properties provide powerful 
mechanisms for enhanced reasoning about instances. A semantic document contains 
instances of different concepts with some or all of the properties instantiated, and 
pointers to the ontologies where concepts used for annotation are defined. 

In ontologies, a property is only specified to the most general concept to which it 
applies; the subclass concepts of this general concept can inherit all of its properties. 
Therefore, the instances of these subclass concepts can instantiate these inherited 
properties as well as their own. For a concept C;, we use P'[c ; ] to represent all the 
properties that instances of concept c ; can instantiate. Therefore, P'fCj] = DPfCj] U 
OP DS [c,] ID DP[Cj] UOP DS [Cj] for any Cj where CjGsubfCj] or c ; eeq[Cj]. Similarly, 
DP'[Cj] and OP' DS [c ; ] represent the set of datatype properties and domain-specific 
object properties that instance i can instantiate where ielfcj, respectively. R'fp] will 
be the set of restrictions on property p. Based on Fig. 2, since subclassOf [Ph.D Stu- 
dent]=Student, P'[Ph.D Student]={quaExamDate, advisedBy, ID, registers}. 

If C; is a subclass of Cj, then instances of c ; are also instances of C:. To be clear, we 
use the following notation: I[c] for the instances which are asserted to belong to con- 
cept c, and ielfc] only if an instance i is asserted to belong to a concept c; I'[c] to 
refer to all the instances of concept c by explicit assertion and by inheritance, in 
which case, I'[c] = I[c] U I[c ; ] for any Cj, where CjG sub[c] or c ; eeq[c]. For instance, 
subclassOf[Ph.D Student]=Student and ‘http://www.rutgers.edu/~amy#amy’e IfPh.D 
student]. Therefore, ‘http://www.rutgers.edu/~amy#amy’e I'fstudent]. For an instance 
i, we use C[i] to represent the set of concepts, where iel'fc], Therefore, 
C[‘http://www.rutgers.edu/~amy#amy’] = {Student, Ph.D Student}. 

Each concept c ; has a set of instances IfcJ = {i l5 ,i n } and V ielfcj, the descrip- 

tion of i may include: (1) URI: A Universal Resource identifier (URI) by which i can 
be universally identified and other instances can refer to it. (2) Concept: c,, the con- 
cept that i is asserted to instantiate. (3) Datatype property instantiation: 
V dpeDP'[c ; ], dp takes a specific value v, denoted as i:dp=v, where the instantiation 
is valid if R'fdp] are satisfied. (4) Object property instantiation: V ope OPj^fe] 
and op[c ; ]=Cj, op takes an instance i- as its value, denoted as i:op=ij where ijSTfCj]. 

This instantiation is valid if R'[op] are satisfied. Note that the instantiation of an ob- 
ject property of a concept taking another concept as its value represents a mapping 
(direct association) between the instances belonging to these two concepts. 

We use P[i], DPfi} and OPfi] to denote the set of properties, datatype properties 
and object properties instantiated by instance i. If ielfc], pePfi], then peP'fc]. Also, 
if ie I[c], then P[i} C P'[c], DPfi] d DP'fc], OPfi] C OP'fc], Take the instance of con- 
cept ‘Ph.D Student’ in Fig. 2 as an example. The URI for this instance is 
http://www.rutgers.edU/~amy#amy and the concept it belongs to is ‘Ph.D Student’. 
P[‘http://www.rutgers.edu/~amy#amy’]={ID. quaExamDate, registers}. 
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Relationships: Given two concepts C; and Cj, we say c ; and c= are directly associated 
through opj if op i [c i ]=Cj or op i [Cjl=c i . We use da[c ; ] to denote the set of concepts that 
are directly associated with concept c ; and dafCjiopJ to denote the concept directly 
associated with concept c j through object property opj. For example, da[Ph.D Stu- 
dent] ={ Student, Faculty} where da[Ph.D Student: subclass Of] = Student. 

The relationship between two instances can be directly associated, indirectly asso- 
ciated or not associated. We will first discuss equivalence between instances as a 
special category, whose transitivity allows it to be either directly associated or indi- 
rectly associated. In this paper, we focus only on direct association between instances 
because our change and profile inference rules are mainly specified upon directly 
associated instances. The relationship between instances covers both explicit and 
implicit relationships. By explicit relationship, we mean the relationship between the 
instances is asserted explicitly by content creators while the implicit relationship be- 
tween two instances can be derived through reasoning. The equivalence between 
instances we discuss below is implicit, if it is evaluated based on the value of its iden- 
tification property. 

Each instance should be given a URI by its owner for identifying the instance and 
for other instances to refer to it. URIs are decentralized and “Two URIs are different 
unless they are the same character for character.”[l] However, different URIs may be 
equivalent if they refer to the same real world object. We notice that equivalent in- 
stances published in different sources may be matched based on the value of some 
identification or quasi-identification property (or a combination of multiple proper- 
ties), similar to the primary key in a relational database table. Sometimes, instances of 
different concepts may be equivalent. For instance, the same international Ph.D stu- 
dent may be instantiated as an instance of concepts such as ‘Student’, ‘International 
Student’, ‘Graduate Student’ or ‘Ph.D Student’. In particular, if concepts share a 
property defined to be owl.InverseFunctionalProperty, then instances that have the 
same value for this property are equivalent. In other words, a property, whose axiom 
defines it to be an owl: Inverse Functional Property, is an identification property. Be- 
sides, equivalence between instances can also be explicitly indicated using 
owl.'samelndividualAs. 

Equivalence between instances: Two instances are equivalent if they refer to the 
same entity in the real world. This definition will be the foundation for the instance 
inference rule on equivalent instances discussed in section 5. We use eq[i m ] for the set 

of equivalent instances of i m with i m itself included. If instances i m e I[Cj] and i n e I[c-] 
are equivalent, c ; and c- refer to the same concept, or c is an equivalent or subclass 
concept of Cj, or Cj is a subclass concept of c ; ; c ; and Cj share an identification datatype 
property dp k such that i m :dp k = i n :dp k or an identification object property op k such that 

i m : °Pk = i„ : °Pk (° r VoPk e e q[i n : °Pk])- 

Direct association between instances: For instances i; and ij where i ; e r[C;], Le I'[Cj], 
if there exists opjGOP'fcJ such that op i [c i ]=c k , c k eC[ij] and ijiop—i^ then i ; and i- are 
directly associated instances through op ; . For instance i, we use da[i] for the set of 
instances directly associated with i, and da[i:op] represents the set of instances di- 
rectly associated with i through op. Take an instance from Fig. 2 as an example, 
da[‘http://www.rutgers.edu/~amy#amy’ registers ]= ‘http://cs.rutgers.edu/course#320’. 
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4 Changes to the Semantic Web Data and Profiles 

Detecting changes to the Semantic Web data requires one to first identify whether a 
change has occurred to the instantiation of concepts along with their properties. If a 
change has occurred, then the change detection needs to identify the actual changes to 
the instances and their properties. A change to the Semantic Web data is denoted 
by A. Three types of changes are significant to our approach: addition of property 
instantiation (denoted by A A ), deletion of property instantiation ( A D ) and update to 
property instantiation value (A u ). We define A A (A D ) = (i, c, p, v), where i is the 
instance involved in the addition (deletion), c the concept to which i belongs, i.e. 
ie 1(c), p the property added or deleted in the change, and v the value of p involved in 
the addition (deletion); A u = (i, c, p, ov, nv), where i is the instance being updated, c 
the concept to which i belongs, i.e. i£ 1(c), p the property being updated, ov the old 
value of p before the update, and nv the new value of p after the update. 

Given a specific change A, we use if A ], c[A], p[A], dp[A], op[A] to denote 
the instance, the concept to which the instance belongs, the property, the datatype 
property and object property involved in the change, respectively. We use v[ A A ] to 
denote the value to p[A A ] added in the change, v[ A D ] to denote the value to p[ A D ] 
deleted by the change, ov[ A u ] to denote the old value of p[ A u ] before the change 
and nv[ A u ] denotes the new value of p[ A u ] after the change. 

Based on the changes to the Semantic Web data, the reasoning engine fires the ap- 
propriate change inference rules and profile inference rules to generate a profile for 
the target instances. We describe the components of a profile below. 

Profile: A profile generated by the reasoning engine consists of: Target instances 
(TI): instances identified explicitly by its URI or instances whose properties take 
specific values; Target concepts (TC): set of concept(s) that target instances belong 
to; Target Properties (TP): set of properties that target instances may or may not in- 
stantiate, based on the type of the change. Specifically, TP are instantiated if the 
change is A u or A D , but they are not instantiated if the change is A A . We use 
EXISTS and NOT EXISTS to indicate whether or not TP are instantiated. The profile 
predicted for the target instances describes their common characteristics in terms of 
concepts, instances as well as their properties, and is further used to locate these in- 
stances. 

5 Inference Rules 

Based on some detected changes and ontologies, the reasoning engine makes smart 
decisions about what instances should have changed and describes them in terms of 
target concepts, instances and properties. The reasoning process relies on different 
types of inference rules triggered in a specific sequence: it first fires the change infer- 
ence rules based on the changes to seed instances and ontologies, then the profile 
inference rules to generate the profile, in which concept inference rules, property 
inference rules and instance inference rules may be triggered. 

The components of an ontology described in section 3 are based on what is explic- 
itly stated in the ontology. The significance of ontologies definitely goes beyond its 
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components by allowing for reasoning, entailed particularly by domain-independent 
object properties and axioms of properties. Domain-independent object properties 
such as subClassOf and equivalentClass have their own axioms such as transitivity. 
For a concept c ; , we use sub|c ; ] to represent the set of concepts which are subclass of 
c ; and eq[c ; ] to represent the set of concepts which are equivalent to c ; , either by defi- 
nition or by inference. These operators will be used in the profile inference rules to 
represent the target concepts for a profile. The concept inferences rules specify how 
the set of equivalent and subclass concepts can be generated for a given concept based 
on the ontology. 

Concept Inference Rules: We identify the following inference rules for identifying 
sub[c] and eq[c] for a given concept c. Let c ; , c- and c k be three concepts. (1) 

(equivalentCIass[ c k ]=c ; v equivalentClass[ c k ]=Cj where CjG eq[c ; ]) => (c k e eq[Cj]). 
This means, if c k is asserted as equivalentClass of c ; or equivalentClass of e, such that 
Cj belongs to the set of concepts equivalent to c ; , then c k belongs to the set of concepts 
equivalent to C;. (2) (c k e eq[Cj]) => (c ; e eq[c k ]). If c k belongs to the set of concepts 
equivalent to c ; , then c ; also belongs to the set of concepts equivalent to c k . (3) (sub- 
classOj\ c k ]=c ; V subclassOfl c k ]=Cj where CjG sub[c ; ]) => (c k e sub[c ; ]); and 
(c ; e intersectionOf[c k ] v c k e unionOflcfj^lsubclassOflc^cf This means, if c k is 
asserted as subClassOf C; or subClassOf Cj such that c- belongs to the set of subclass 
concepts of c ; , then c k belongs to the set of subclass concepts of c ; . Moreover, c k is a 
subclass of Cj can also be implied if c k is the intersection of c ; and some other con- 
cepts, or if Cj is the union of c k and some other concepts. (4) (c k 6 eq[Cj] A c k e sub[C;]) 
=> (CjG sub[C;]). This means, if c k belongs to the set of concepts equivalent to c and 
the set of subclass concepts of c ; , then Cj also belongs to the set of subclass concepts 
of C;. (5) (c k e eq[Cj] A c ; e sub[c k ]) => (CjG sub[Cj]). This mean, if c ; belongs to the set of 
subclass concepts of c k and c k belongs to the set of equivalent concepts of Cj, then c ; 
also belongs to the set of subclass concepts of u. 

Similar rules for sup[p] and eq[p] can be built based on subPropertyOfQ and 
equivalentPropertyO . The property inference rules below would help us identify the 
target properties from the set of properties where initial changes are detected. For a 
property p, we use sup[p] to represent the super-properties of p, i.e. the properties to 
which p is a sub-property, and eq[p] to represent the properties equivalent to p. 

Property Inference Rules: We identify the following inference rules for identifying 
sup(p) and eq(p) for a given property p. Let pj, pj and p k be three properties. (1) 

(equivalentProperty[ p k ]=p ; V equivalentProperty[p k ]=p^ where pjGeq[p;]) => 
(p k e eqfpjj). This means, if p k is asserted as equivalentProperty of p |; or that of pj such 
that pj belongs to the set of properties equivalent to p ; , then p k belongs to the set of 
properties equivalent to p k (2) (p k e eqfp;]) => (p;G eq[p k ]). This means, if p k belongs to 
the set of properties equivalent to p ; , then p ; also belongs to the set of properties 
equivalent to p k . (3) (subPropertyOjl p k ]=Pj V subPropertyOf[p k ]=p i where 
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p ; esup[pj]) => (p ; e sup[p k J). This means, if p k is asserted as subPropertyOf p ; or that 
of pj such that p ; belongs to the set of super-properties of Pj, then p ; belongs to the set 
of super-properties of p k . (4) (p k e eq[pj] A p k e sup[p ; ]) => (pj£ supfpj). This means, if 
p k belongs to the set of properties equivalent to pj and the set of super-properties of p ; , 
then pj also belongs to the set of super-properties of p ; . (5) (p k e eq[pj] A p ; e sup[p k ]) 
=> (Pj£ sup[pj]). This means, if p k belongs to the set of properties equivalent to pj and 
Pj belongs to the set of super-properties of p k , then p ; also belongs to the set of super- 
properties of Pj. 

Instance inference rules used by the reasoning engine can identify eq[i] and da[i] 
for a given instance i. The complexity of identifying equivalent instances results from 
the distributed nature of the web, where owners of information sources are allowed to 
instantiate equivalent instances by resorting to even different concepts and properties. 
The target instances in the profile generated by the profile inference rules may call the 
rules defined below to get the instances. 

Equivalent Instance Inference Rule: We identify the following inference rule for 
identifying eq(i) for a given property i. Let ijbe an instance. Given i ; e I[Cj], eq(ij)={i| 
(ielfCj] such that Cj=c ; , Cj£eq[C;], C;Gsub[Cj] or c.e sub[c ; ]) A (i:p k =ij:p k V 
i:p k =eq[i ; :p k ] such that InverseFunctionalPropertye A[p k ])}. This rule states that 
given ij is an instance of concept c ; , the equivalent instances of i ; will be those that are 
instances of Cj or an equivalent concept of c ; or a subclass concept of c l or a concept to 
which Cj is a subclass concept, and take the same value for its identification property 
as that for i. 

Directly Associated Instance Inference Rules: We identify the following inference 
rules for identifying da(i) for a given instance i. (1) Given i;eT[C;], if opj[c k ]=Cj (or 
op;[Cj]=c k ) where c k e C[ij] and SymmetricProperty £ Afop;], where Afp] as the set of 
axioms over property p, then da[i;:op;] = {i|ie T[Cj] A (i m :op ; =ij where i m £eq[i;] and 
ij£eq[i]) (or ij:opj=i m where i m eeq[i;] and ij eeq[i})}. This rule states that, given i ; is 
an instance (by assertion or inference) of concept C;, c ; is directly associated with 
concept Cj through object property op ; and op ; is not a symmetric property, then the 
directly associated instances of i ; through op i belong to concept c (by assertion or 
inference), and these instances or their equivalent instances are taken as the value to 
object property op ; of i; or its equivalent instances. It also implies that, if Cj is directly 
associated with c through object property opj, then the directly associated instances of 
ij through opj belong to concept Cj (by assertion or inference), and these instances or 
their equivalent instances take i ; or the equivalent instances of i ; as the value to object 
property opj. (2) Given ij£T[C;], if op ; [c k ]=Cj(or opj(Cj]=e k ) where c k eC[i;] and Sym- 
metri cPrope rtye A[op;], then da[ij:op;]={i|ie I’[Cj] A (i m :opj=ij V i j :op i =i m where 
i m eeqfi;] and he eq[ij) j. This states that, given ij is an instance of concept c ]5 c ; is 
directly associated with concept Cj through object property op ; (or Cj is directly associ- 
ated with C; through object property opj) and opj is a symmetric property, then the 
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directly associated instances of r through op ; belong to concept c, these instances or 
their equivalent instances either are taken as the value to object property op ; of i ; or its 
equivalent instances, or take i ; or the equivalent instances of i; as the value to object 
property op;. In the case that c ; and Cj are directly associated through a symmetric 
property, c ; =Cj or c ; eeq(Cj). (3) Given ijeTfCj], if op ; [c k ]=Cj where c k eC[i ; ] and in- 
verseOjl opj]=opj (or inverseOjl opj]=op;), then da[i i :op i ,opj]={i|ie I'[c,] 

A (i m :opj=ij V ij :opj=i m where i m G eq[i,] and ij£eq[i])}. This states that, given i 4 is an 
instance of concept c ; , c ; is directly associated with concept Cj through object property 
op ; and property op; is inverseOJ opj, then the directly associated instances of i; 
through opj and opj belong to concept c-, these instances or their equivalent instances 
either are taken as the value to object property op ; of f or its equivalent instances, or 
take ij or its equivalent instances as the value to object property opj. 

Our previous inference rules are defined through reasoning based on the ontologies 
whereas the change inference rules below are based on intuition. 

Change Inference Rules: (1) V A ; , if re eq f i [ A J], then there exists A j where 
i[Aj]=ij. (2) V A;, i[ A ;]:opf A j]= ij, there exists Aj where i[Aj]=ij or i[Aj]=i k 
where i k e eq[i[ Aj]]; If SymmetricPropertye A[op[ A ; ]1, then op[ Aj |=op[ A j], else 
inverseOjl op[Aj]] = op[Aj], (3) For p;eP[Cj], pjeP[Cj] and op[C;]=Cj, if dependency 
exists from pj to p ; , which we denote as value DependentOni p^ J=p ; , V A ; where 
op] A j]=Pj, i[ A ;]=ij, there exists A ■ where op[ A j]=pj and i[ A j]e da[i ; :op]. 

The first rule states that, if an instance has changed and it has equivalent instances, 
it implies that these equivalent instances also have changed. The second rule states 
that, if a change involves an object property that has an inverse property defined in 
the ontology, or this object property itself is symmetric, then it implies that the di- 
rectly associated instances as well as their equivalent ones have also changed. The 
third rule states that, if the value to a property has changed and dependency exists 
from another property of the directly associated concept to this property, it implies the 
value to the other property involved in the dependency should also have changed. Our 
justification for these three rules is that there exists a high probability that the seman- 
tically related instances will change consistently with the detected changes, as a result 
of the efforts by the same or even different information sources to maintain freshness 
and consistency. Though the second and the third rules both involve directly associ- 
ated instances, they are significantly different, in that while the third rule involves all 
the directly associated instances through the object property between the concepts 
involved, the second rule involves only some of the directly associated instances 
when the cardinality of the relationship is not one-to-one. 

To our knowledge, dependency we proposed in the third rule is supported by none 
of the ontology languages proposed so far. Though these dependencies can be ob- 
tained after ontology construction through analyzing the ontology or learning from the 
change history of the Semantic Web data, our proposal suggests they be specified as 
an add-on to the current proposal for ontology languages so that dependencies can be 
specified by domain experts during ontology construction. This dependency add-on 
can reveal the relationships between the concepts and their properties from a com- 
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pletely new perspective, and can serve other purposes such as helping maintain data 
instantiation. 

6 Ontology-Guided Change Detection 

Profile Inference Rules: Based on the first change inference rule, our focus here is to 
profile the equivalent instances of a given instance, to which a change has been de- 
tected. Changes are expected to the corresponding property of the equivalent in- 
stances. Assuming a change A t has been detected to an instance i ; , where I[ A j]=ij and 
p[ A ;]=Pj, where p ; is some property of i ; . The reasoning engine needs to find out to 
which concept c | the property is defined in the ontology, where piAjePfcJ. Note 
that c ; is not always necessarily equal to c[ A J, but it could be any other concept in 
sub[c ; ] or eq[c ; ] since p[A ; ] can be instantiated by any instance of these concepts. 
Identifying the target instances (TI) will call the instance inference rules for 
eq[i[ A ; ]]. As indicated in the instance inference rules, there should exist 
PjeP'[c[ A;]], where p, is either designated as an identification property of c ; or is 
defined to be an InverseFunctionalProperty. For any i-e eq[i;], i^ :p 1 =i ; :p I . Then the 
reasoning engine needs to find out whether p ; is the sub-property of some other prop- 
erty pj where subProperty Of( p ; )=Pj. The equivalent instances of i[A ; ] may have 
changes to p ; or p ( . Note that a change to a property always triggers a change to its 
super-property, but may not necessarily change its sub-properties. The reasoning 
engine also checks whether p ; has any equivalent properties. 

Profile for Equivalent Instances: Given a change A , the profile for the equivalent 
instances is: 

• TI={eq[i[ A]]}; 

• TC={c i , sub[c ; ], eqfcj } where p[ A ]e P[c ; ]; 

• TP={p[A],eq[p[A]],sup[p[ A]]} EXISTS for A D and A u , NOT EXISTS for A A . 

For any change involving an object property, at least two instances directly related 

by the property are involved: the instance to which a change is detected to its object 
property and the instance taken as the value to this object property. If the inverse 
relationship of p[A] is defined in the ontology, which means there exists another 
object property op ; where inverseOf[op^\=op[ A] (or inverseOf[op[ A ]]=op j ) or prop- 
erty p[ A] is symmetric indicated by Symmetric P rope rtye A[p[ A 1 1, then the change 
may also be reflected in the instance which is taken by the property p[ A ] . The target 
instances may associate themselves with i[ A ] through the inverse property op; or the 

symmetric property p[ A ] itself. Here comes the profile for the directly associated 
instances by inverse or symmetry. 

Profile for Directly Associated Instances by Inverse: Given a change A involving 
op[ A] with inverseOJ[op[ A ]] defined in the ontology, the profile for the directly 
associated instances by inverse is: 
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• For A A : TI={v( A A ), eq[v[ A A ]]}; 

For A D : 

if the cardinality of op[ A D ] is one-to-one: TI={da[if A D ]:op[ A D ], in- 

verseOf[ op[ A D ]]] } ; 

else: TI={v[ A D ], eq[v[ A D ]]}. 

For A u : 

if the cardinality of op[ A u ] is one-to-one: TI={da[if A u ]:op[ A u ], in- 
verseOj[ op[ A u ]]], ov[ A u ], eq[ov[ A u ]], nv[A u ], eq[nv[A u ]]}; 
else: TI={ov[ A u ], nv[ A u ], eq[ov[ A u ]], eq[nv[ A u ]] }. 

• TC={cj, subfcj], eq[cj] } if inverseOj[ op[ A ]]e P[q]; 

• TP={ inverseOf[op[ A ]], Q(\[inverseOj{op[ A ]]], sup[/77ver^(?0/[op[ A ]]] } EXISTS 
for A D and A u , NOT EXISTS for A A . 

Profile for Directly Associated Instances by Symmetry: Given a change A involv- 
ing a symmetric property op[ A], the profile for the directly associated instances by 
symmetry is: 

• For A A : TI={v[ A A ], eq[v[ A A ]]J; 

For A D : 

if the cardinality of op[ A D ] is one-to-one: TI={da[i[ A D ]:op[ A D ]] }; 
else: TI={v[ A D ], eq[v[ A D ]]}. 

For A u : 

if the cardinality of op[A u ] is one-to-one: TI={da[i[ A u ]:op[ A u ]], ov[A u ], 

eq[ov[ A u ]], nvf A u ], eq[nv[ A u ]]}; 

else: TI={ov[ A u ], nv[ A u ], eq[ov[ A u ]], eq[nv[ A u ]] } . 

• TC={cj, subfcj], eq[cj] } where op[ A]eP[cj]; 

• TP={op[ A], eq[op[ A ]], sup[op[A]]} EXISTS for A D and A u , NOT EXISTS 
for A A . 

As we have discussed, we propose an add-on to ontology languages by including 
the dependency among concepts or their properties into ontologies. Note that these 
dependencies will be specified to the most general concepts applicable. 

Profile for Directly Associated Instances by Dependency: Given a change A and 
valueDependentOn[ p^pfA] where p ; e P[c. J , the profile for directly associated in- 
stances by dependency is: 

• TI={ da[i[ A ]:opj] }, if opj[c[ A ]l=Cj or opj[cj]=c[ A ]; 

TI={da[i[ A]:opj,opj]}, if inverseOj\ opjl=opj or 7«verjeO/[opj]=opj. 

• TC={cj, eq[cj], sub[cj]}; 

• TP={pj, eq[pj], suplpj]) EXISTS for A D and A u , NOT EXISTS for A A . 

Intelligent Change Detection: The following algorithm describes how the intelligent 
change detection is done: 
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[1] while the URL list is not empty 

[2] The web crawler retrieves a URL from the front of the list 

[3] The web crawler downloads the web page 

[4] The indexer indexes the web page 

[5] The change detector diffs the web page against its previous version from the re- 
pository 

[6] The change detector saves the web page and the changes to the repository, and 
sends changes to the reasoning engine 

[7] The reasoning engine determines the profile for the target web pages to be visited 

[8] The page locator finds out the URLs based on the profile by querying the index 

[9] The page locator appends the URLs to the URL list 

The web crawler starts downloading web pages by retrieving a URL from a given 
list and the full-text of the page is indexed. A diff algorithm is run to find out whether 
and what changes have occurred to this page by comparing it with its earlier version 
retrieved from the local repository. The detected changes are saved to the repository 
along with the new version of the page. Meanwhile, the detected changes are sent to 
the reasoning engine, which use the inference rules we defined to generate a profile 
for the target instances to be visited next. The profile is used to locate the URLs of the 
instances satisfying the profile and these URLs are appended to the front of the URL 
list so that the crawler can visit them next. As can be seen, step 7 of this algorithm is 
the key to our approach. After the reasoning engine derives the profile for the in- 
stances to visit next, URLs for these target instances satisfying the profile have to be 
determined. More specifically, the profile is translated into queries, which are evalu- 
ated against the index maintained by the search engine or local repository, and URLs 
for the target instances are returned. 



7 Conclusions and Future Work 

We have presented semantics-based change detection approach guided by ontologies 
for the Semantic Web data. Given changes to some instances, the reasoning engine 
works by firing pre-defined rules applicable to the changes, referring to the ontologies 
that instances point to and generating a profile for the target instances to be visited 
next. Due to the limited number of the Semantic Web data, we are yet not in a posi- 
tion to experiment on the efficiency and scalability of our approach. We are planning 
to implement our reasoning engine and test it over synthesized data and changes. 

The target instances may be determined with more change information. For in- 
stance, a sample can be taken from the instances of a concept before reasoning is 
done. Or, certain instances of multiple related concepts can be visited and their 
changes may be propagated through the object properties to concepts of a wider 
range. As a result, the target instances may belong to some indirectly associated con- 
cepts. As part of our future work, we intend to study what role the semantic locality 
plays in change detection. 
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Abstract. In heterogeneous data warehousing environments, autono- 
mous data sources are integrated into a materialised integrated database. 
The schemas of the data sources and the integrated database may be ex- 
pressed in different modelling languages. It is possible for either the data 
source schemas or the warehouse schema to evolve. This evolution may 
include evolution of the schema, or evolution of the modelling language 
in which the schema is expressed, or both. In such scenarios, it is impor- 
tant for the integration framework to be evolvable, so that the previous 
integration effort can be reused as much as possible. This paper describes 
how the AutoMed heterogeneous data integration toolkit can be used to 
handle the problem of schema evolution in heterogeneous data warehous- 
ing environments. This problem has been addressed before for specific 
data models, but AutoMed has the ability to cater for multiple data 
models, and for changes to the data model. 



1 Introduction 

With the increasing use of the Internet in distributed applications, data ware- 
houses may integrate data from remote, heterogeneous, autonomous data sources. 
The heterogeneity of these data sources has two aspects, heterogeneous data 
expressed in different data models, called model heterogeneity [10], and hetero- 
geneous data within different data schemas expressed in the same data model, 
called schema heterogeneity [10,18]. The common approach to handling model 
heterogeneity is to use a single conceptual data model (CDM) for the data trans- 
formation/integration. Each data source has a wrapper for translating its schema 
and data into the CDM. The warehouse schema is derived from these CDM 
schemas by means of view definitions, and is expressed in the same modelling 
language as them. With this approach, since they are both high-level conceptual 
data models, semantic mismatches may occur between the CDM and a source 
data model, and there may be a loss of information between them. Moreover, 
if a data source schema changes, it is not straightforward to evolve the view 
definitions of the warehouse schema. 

Lakshmanan et al [11] argue that a uniform framework for schema integra- 
tion and schema evolution is both desirable and possible, and this is our view 
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also. They define a higher-order logic language, SchemaSQL, which handles both 
data integration and schema evolution in relational multi-database systems. In 
contrast, our approach uses a simple set of schema transformation primitives, 
augmented with a functional query language, both of which are uniformly ap- 
plicable to multiple data models. Other previous work on schema evolution, e.g. 
[1-4] , has also presented approaches in terms of just one data model. 

AutoMecl is a heterogeneous data transformation and integration system 
which offers the capability to handle data integration across multiple data mod- 
els 1 . In [7] we discussed how AutoMed metadata can be used to express the 
schemas and the cleansing, transformation and integration processes in hetero- 
geneous data warehouse environments, supporting both schema heterogeneity 
and model heterogeneity. We discussed how this metadata can be used to pop- 
ulate and incrementally maintain the warehouse, and any data marts derived 
from it, and also to trace the lineage of data in the warehouse or the data marts. 
It is clearly advantageous to be able to reuse this kind of metadata if a schema 
evolves. In this paper we show how this can be achieved. 

Earlier work [16] has shown how the AutoMed framework readily supports 
schema evolution in virtual data integration scenarios. In this paper we address 
the problem of schema evolution in materialised data integration scenarios, in- 
cluding both evolution of a source schema and of the warehouse schema, and also 
the impact on any data marts derived from the warehouse. This scenario is more 
complex than with virtual data integration, since both schemas and materialised 
data may be affected by an evolution. 

The outline of the paper is as follows. Section 2 gives an overview of the 
AutoMed framework. Section 3 describes how AutoMed transformations can 
be used to express a schema evolution if either the schema changes, or the data 
model changes, or both. Section 4 describes the actions that are taken in order to 
evolve these transformations and the materialised data if the warehouse schema 
or a local schema evolves. Section 5 discusses the benefits of our approach and 
gives our concluding remarks. 

2 Overview of AutoMed 

AutoMed supports a low-level lrypergraph-based data model (HDM). Higher- 
level modelling languages are defined in terms of this HDM. For example, pre- 
vious work has shown how relational, ER, 00 [15], XML [21], flat-file [5] and 
multidimensional [7] data models can be so defined. An HDM schema consists of 
a set of nodes, edges and constraints, and each modelling construct of a higher- 
level modelling language is specified as some combination of HDM nodes, edges 
and constraints. For any modelling language A4 specified in this way (via the API 
of AutoMed’s Model Definitions Repository [5]), data source wrappers translate 
data source schemas expressed in A4 into their AutoMed representation, without 
loss of information. AutoMed also provides a set of primitive schema transforma- 
tions that can be applied to schema constructs expressed in A4. In particular, for 

1 See http : //www . doc . ic . ac .uk/automed/ 
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every construct of A4 there is an add and a delete primitive transformation which 
add to/delete from a schema an instance of that construct. For those constructs 
of A4 which have textual names, there is also a rename primitive transformation. 

In AutoMed, schemas are incrementally transformed by applying to them a 
sequence of primitive transformations t \, . , . ,t r . Each primitive transformation 
adds, deletes or renames just one schema construct. Thus, intermediate schemas 
may contain constructs of more than one modelling language. 

Each add or delete transformation is accompanied by a query specifying the 
extent of the new or deleted construct in terms of the rest of the constructs in 
the schema. This query is expressed in a functional query language IQL (see Sec- 
tion 2.1). Also available are contract and extend transformations which behave 
in the same way as add and delete except that they indicate that their accompa- 
nying query may only partially construct the extent of the new/removed schema 
construct. Moreover, their query may just be the constant Void, indicating that 
the extent of the new/removed construct cannot be derived even partially, in 
which case the query can be omitted. 

We term a sequence of primitive transformations from one schema Si to 
another schema S 2 a transformation pathway from Si to S 2 , denoted Si — » S 2 . 
All source, intermediate, and integrated schemas, and the pathways between 
them, are stored in AutoMed’s Schemas & Transformations Repository [5]. 

The queries present within transformations that add or delete schema con- 
structs mean that each primitive transformation t has an automatically derivable 
reverse transformation, t. In particular, each add/extend transformation is re- 
versed by a delete/contract transformation with the same arguments, while each 
rename transformation is reversed by swapping its two arguments. Thus, Au- 
toMed is a both-as-view (BAV) data integration system. As discussed in [17], 
BAV subsumes the global-as-view (GAV) and local-as-view (LAV) approaches 
[13], since it is possible to extract a definition of each global schema construct as 
a view over source schema constructs, and it is also possible to extract definitions 
of source schema constructs as views over the global schema. We refer the reader 
to [9] for details of AutoMed’s GAV and LAV view generation algorithms. 

Figure 1 illustrates the general integration scenario with AutoMed. Each data 
source is described by a local schema LSi. Each LSi is first conformed into a 
schema C Si (which may or may not be expressed in the same modelling language 
as LSi) by means of a transformation pathway Tj. Not all of the information 
within a local schema LSi need be transferred into the global schema and this is 
asserted by means of contract transformation steps within Tj. Conversely, there 
may be information within the global schema which is not semantically derivable 
from LSi, and this is asserted by the pathway from CSi to a ‘union-schema’ USi 
which consists of the necessary extend transformations 2 . 

All the union schemas US 1 , . . . , U S n are syntactically identical and this is 
asserted by creating a sequence of id transformations between each pair USi 
and USt+i, of the form id USi '■ c USi + 1 : c for each schema construct c. An 
id transformation signifies the semantic equivalence of syntactically identical 



If there are none, then this pathway is empty and CSi and USi are the same schema 
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constructs in different schemas. The transformation pathways containing these 
id transformations are automatically generated by the AutoMed software. An 
arbitrary one of the USi (US in Figure 1) can then be selected for further 
transformation into the global schema GS (by the pathway T u in Figure 1). The 
extent of each construct c in a union schema U Si is equal to the bag-union of 
the extent of c in all union schemas U Si, . . . , U S n . That is, id is interpreted as 
bag union by AutoMed’s view generation functionality. 

In a virtual data integration scenario, there is no materialised data associated 
with any of the schemas apart from the LSi. In a data warehousing scenario, 
as illustrated in Figure 1, we assume that CS\, . . . , CS n are fully materialised 
and consist of the detailed data of the warehouse. This detailed data is further 
augmented with the necessary summary views by the transformations in the 
pathway T Ul and we assume that these summary views are materialised in the 
database GD. It would also be possible to partially or fully materialise more of 
the intermediate schemas in the network, or to not materialise CS\, . . . , CS n and 
to fully materialise GS instead. Our techniques in this paper easily generalise to 
these alternatives. 




The Global 
Schema and 
Database 



Union 

Schemas 



Conformed 
Schemas and 
Databases 



Local 

Schemas and 
Databases 



Fig. 1 . Materialised Data Integration in AutoMed 



For the purposes of this paper, we assume that all the LSi and LD, have 
been extracted from the original data sources and the data in the LDi has 
been cleansed. The data cleansing process can also be expressed using AutoMed 
transformations - this is discussed in [7] and we do not consider it further here. 
See also that paper for some examples of how AutoMed transformations can 
express structural and representational changes to schemas and data. 

We also assume here that there are no contract steps in the pathways Ti, i.e. 
that all the information in each LSi will be transferred to CSi and hence to U Si. 
This implies no loss of flexibility as each LSi will be precisely that extract of the 
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original data source schema whose associated data is to be transferred into the 
warehouse. 



2.1 The IQL Query Language 

IQL is a comprehensions-based functional query language 3 . Such languages sub- 
sume query languages such as SQL and OQL in expressiveness [6] . IQL supports 
several primitive operators for manipulating lists. The list append operator, ++, 
concatenates two lists together. The distinct operator removes duplicates from 
a list and the sort operator sorts a list. The — operator takes two lists and sub- 
tracts each member of the second list from the first e.g. [1,2, 3, 2, 4] — [4, 4, 2,1] = 
[3,2], The fold operator applies a given function f to each element of a list and 
then ‘folds’ a binary operator op into the resulting values. It is defined recursively 
as follows, where (x:xs) denotes a list with head x and tail xs: 
fold f op e [] = e 

fold f op e (x:xs) = (f x) op (fold f op e xs) 

Other IQL list manipulation operators are defined using fold together with 
IQL’s set of built-in operators and its support of lambda abstractions. For ex- 
ample, the IQL functions sum and count are equivalent to SQL’s SUM and 
COUNT aggregation functions and are defined as 
sum xs = fold (id) (+) 0 xs 
count xs = fold (lambda x.l) (+) 0 xs 
We also have 

min xs = fold (id) lesser maxNum xs 
max xs = fold (id) greater minNum xs 
assuming constants maxNum and minNum and the following functions lesser and 
greater: 

greater = lambda x. lambda y.if (x > y) then x else y 
lesser = lambda x. lambda y.if (x < y) then x else y 
The function flatmap applies a list-valued function f to each member of a list 
xs and is defined in terms of fold: 

flatmap f xs = fold f (++) [] xs 

flatmap can in turn be used to define selection, projection and join operators 
and, more generally, comprehensions. For example, the following comprehen- 
sion iterates through a list of students and returns those students who are not 
members of staff: 

[x I x <- <<student>>; not (member <<staff>> x)] 

and it translates into: 

flatmap (lambda x.if (not (member <<staff>> x)) 
then [x] else [] ) <<student>> 

Grouping operators are also definable in terms of fold. In particular, the opera- 
tor group takes as an argument a list of pairs xs and groups them on their first 
component, while gc aggFun xs groups a list of pairs xs on their first component 
and then applies the aggregation function aggFun to the second component. 

3 We refer the reader to [8] for details of IQL 
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There are several algebraic properties of IQL’s operators that we can use 
in order to incrementally compute materialised data and to reason about IQL 
expressions, specifically for the purposes of this paper in a schema/data evolu- 
tion context (note that the algebraic properties of fold below apply to all the 
operators defined in terms of fold): 

(a) e ++ [] = [] ++ e = e, e — [] = e, [] — e = [] , 
distinct [] = sort [] = [] 

for any list-valued expression e. Since Void represents a construct for which 
no data is obtainable from a data source, it has the semantics of the empty 
list, and thus the above equivalences also hold if Void is substituted for [] . 

(b) fold f op e [] = fold f op e Void = e, for any f , op, e 

(c) fold f op e (bl ++ b2) = (fold f op e bl) op (fold f op e b2) 
for any f , op, e , bl , b2. Thus, we can always incrementally compute the 
value of fold-based functions if collections expand. 

(d) fold f op e (bl — b2) = (fold f op e bl) op’ (fold f op e b2) 
provided there is an operator op’ which is the inverse of op i.e. such that 
(a op b) op’ b = a for all a,b. For example, if op = + then op’ = 
and thus we can always incrementally compute the value of aggregation 
functions such as count, sum and avg if collections contract. Note that this 
is not possible for min and max since lesser and greater have no inverses. 
Although IQL is list-based, if the ordering of elements within lists is ignored 
then its operators are faithful to the expected bag semantics, and within 
AutoMed we generally do assume bag semantics. Under this assumption, 
(xs ++ ys) — ys = xs 

for all xs,ys and thus we can incrementally compute the value of flatmap 
and all its derivative operators if collections contract 4 . 

2.2 An Example 

We will use schemas expressed in a simple relational data model and a simple 
XML data model to illustrate our techniques. However, we stress that these 
techniques are applicable to schemas defined in any data modelling language 
that has been specified within AutoMed’s Model Definitions Repository. 

In the simple relational model, there are two kinds of schema construct: Re I 
and Att. The extent of a Rel construct ((R)) is the projection of the relation R 
onto its primary key attributes k \, ..., k n . The extent of each Att construct ((R, a)) 
where a is an attribute (key or non-key) of R is the projection of relation R onto 
fci, ..., k n , a. For example, the schema of table MAtab in Figure 2 consists of a 
Rel construct ((MAtab)), and four Att constructs ((MAtab, Dept)), ((MAtab, CID)), 
((MAtab, SID)), and ((MAtab, Mark)). We refer the reader to [15] for an encoding 
of a richer relational data model, including the modelling of constraints. 

In the simple XML data model, there are three kinds of schema construct: 
Element, Attribute and NestSet. The extent of an Element construct ((e)) consists 

The distinct operator can also be used to obtain set semantics, if needed 
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of all the elements with tag e in the XML document; the extent of each Attribute 
construct ((e, a)) consists of all pairs of elements and attributes x, y such that 
element x has tag e and has an attribute a with value y; and the extent of each 
NestSet construct ((p, c)) consists of all pairs of elements x, y such that element x 
has tag p and has a child element y with tag c. We refer the reader to [21] for an 
encoding of a richer model for XML data sources, called XMLDSS, which also 
captures the ordering of children elements under parent elements and cardinality 
constraints. That paper gives an algorithm for generating the XMLDSS schema 
of an XML document. That paper also discusses a unique naming scheme for 
Element constructs so as to handle instances of the same element tag occurring 
at multiple positions in the XMLDSS tree. 

Figure 2 illustrates the integration of three data sources LDi , LD 2 , and LD 3 , 
which respectively store students’ marks for three departments MA, IS and CS. 
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- <root> 

- < course CID="ISC01" cname-Math"> 

<student SID-'ISSOI” mark='76"/> 

<student SID="ISS02” mark-78" /> 

</course> 

- < course CID="ISC02" cname-Programming"> 

<student SID=''ISS01" mark="86" /> 
otudent SID="ISS02" mark="85"/> 

</oourse> 

</root> 
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Fig. 2. An example integration 



Database LDi for department MA has one table of students’ marks for each 
course, where the relation name is the course ID. Database LD 2 for department 
IS is an XML file containing information of course IDs, course names, student IDs 
and students’ marks. Database LD 3 for department CS has one table containing 
one row per student, giving the student’s ID, name, and mark for the courses 
CSC01, CSC02 and CSC03. CD\, CD 2 , and CD 3 are the materialised conformed 
databases for each data source. Finally, the global database GD contains one 
table Cou rseSu m( Dept. Cl D , Total. Avg) which gives the total and average mark 
for each course of each department. Note that the virtual union schema US 
(not shown) combines all the information from all the conformed schemas and 
consists of a virtual table DetailsfDept. CID . SID .CName.SName.Markh 

The following transformation pathways express the schema transformation 
and integration processes in this example. Due to space limitations, we have not 
given the remaining steps for deleting/contracting the constructs in the source 
schema of each pathway (note that this ‘growing’ and ‘shrinking’ of schemas is 
characteristic of AutoMed schema transformation pathways): 
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Ti : LS 1 -»■ CS 1 

addRel «MAtab» [{’MA’ , ’MACOl’ ,x}|x-<— «MAC01»] ++ [{’MA’ , ’MAC02’ ,x}|x<— ((MAC02))] 

H — b[{’MA’ , ’ MAC03 ’ ,x}|x^ «MAC03»]; 
addAtt ((MAtab, Dept)) [{kl,k2,k3,kl}-|{kl,k2,k33-<— ((MAtab))]; 
addAtt ((MAtab, CID» [{kl,k2,k3,k2}|{kl,k2,k3}<- ((MAtab))]; 
addAtt ((MAtab, SID)) [{kl,k2,k3,k3}|{kl,k2,k3}<— ((MAtab))]; 
addAtt ((MAtab, Mark)) [{’MA’ , ’MACOl’ ,k,x>Kk,x><- ((MACOl, Mark))] 

H — b[{’MA’ , ’MAC02’ ,k,x> |{k,x}<— ((MAC02, Mark))] 

H — b[{’MA’ , ’ MAC03 ’ ,k,x> |{k,x}<— ((MAC03, Mark))]; 
delAtt ((MAC01, Mark)) [{k3,x} I {kl ,k2,k3,x}-< — ((MAtab, Mark)); k2=’MAC01 ’]; 
del Att ((MACOl , SID» [{k3 , x> I {kl , k2 , k3 , x> <- (( M Ata b, SI D)) ; k2= ’ MACOl ’ ] ; 
delRel ((MACOl)) [{k3> I {kl ,k2,k3}<— ((MAtab)); k2=’ MACOl ’] 



The removal of the other two tables in LS\ is similar. 



t 2 : ls 2 -> cs 2 

addRel ((IStab)) [FIS’ ,x,y}|{c,x}< — ((course, C I D )) ; {s,y}< — ((student, SID))]; 

addAtt ((IStab, Dept)) [{kl ,k2,k3,kl} Kkl ,k2,k3}< — ((IStab))]; 

addAtt ((IStab, Cl D)) [{kl ,k2,k3,k2} |{kl ,k2,k3}< — ((IStab))]; 

addAtt ((IStab, SID)) [{kl,k2,k3,k3}|{kl,k2,k3}-'< — ((IStab))]; 

addAtt ((IStab, CName)) [{’IS’ ,x,y ,n} I {cl ,x} <— ((course, C I D )) ; {c2,n}< — ((course, cname)); cl=c2; 

{c3,sl}< — ((course, student)); c3=c2; {s2,y}< — ((student, SID)); s2=sl]; 
addAtt ((IStab, Mark)) [{’IS’ ,x,y ,m} I {cl ,x} <— ((course, C I D )) ; {c2,sl} < — ((course, student)); cl=c2; 

{s2,y}< — ((student, SID)); s2=sl; {s3,m} « — ((student, mark)); s3=s2]; 



T 3 : LS 3 -* CS 3 

addRel ((CStab)) [{’CS’ ,x,y>|x^ [’CSC01’ , ’CSC02’ , ’ CSC03 ’ ] ; y <— ((CSMarks))] ; 

addAtt ((CStab, Dept)) [{kl ,k2,k3,kl} I {kl ,k2,k3}'< — ((CStab))]; 

addAtt ((CStab, CID)) [{kl ,k2,k3,k2} I {kl ,k2,k3}-<— ((CStab))]; 
addAtt ((CStab, SID)) [{kl ,k2,k3,k3}- 1 {kl ,k2,k3}< — ((CStab))]; 

addAtt ((CStab, SName)) [{’CS’ ,x,k,s}|x« — [’CSC01’ , ’CSC02’ , ’CSC03’]; {k,s}-< — ((CSMarks, SName))]; 
addAtt ((CStab, Mark)) [{»CS\ ’CSCOl’ ,k,x> I {k,x}<— ((CSMarks, CSC01))] 

++[{’CS’ , ’08002’ ,k,x> I {k,x}<— ((CSMarks, CSC02))] 

++[{’CS’ , ’CSC03’ ,k,x} I {k,x}<— ((CSMarks, CSC03))]; 



T u : US -► GS 

addRel ((CourseSum)) distinct [{kl ,k3} I {kl ,k2,k3}"< — ((Details))]; 

addAtt ((CourseSum, Dept)) [{kl,k2,kl} I {kl ,k2}"< — ((CourseSum))]; 
addAtt ((CourseSum, CID)) [{kl,k2,k2} I {kl ,k2]"< — ((CourseSum))]; 
addAtt ((CourseSum, Total)) [{x,y ,z}|{{x,y}- ,z} ■< — 

(gc sum [{{kl,k3},x}-|{kl,k2,k3,x}-'< — ((Details, Mark))])]; 
addAtt ((CourseSum, Avg)) [{x,y ,z}|{{x,y]- ,z} 

(gc avg [{{kl,k3},x}-|{kl,k2,k3,x]-'< — ((Details, Mark))])]; 



3 Expressing Schema and Data Model Evolution 

In a heterogeneous data warehousing environment, it is possible for either a 
data source schema or the integrated database schema to evolve. This schema 
evolution may be a change in the schema, or a change in the data model in 
which the schema is expressed, or both. AutoMed transformations can be used 
to express the schema evolution in all three cases: 

(a) Consider first a schema S expressed in a modelling language Ad. We can ex- 
press the evolution of S to S new , also expressed in Ad , as a series of primitive 
transformations that rename, add, extend, delete or contract constructs of Ad. 
For example, suppose that the relational schema LS\ in the above example 
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evolves so its three tables become a single table with an extra column for 
the course ID. This evolution is captured by a pathway which is identical to 
the pathway LS\ — ■> CS\ given above. 

This kind of transformation that captures well-known equivalences between 
schemas can be defined in AutoMed by means of a parametrised transforma- 
tion template which is schema- and data-independent. When invoked with 
specific schema constructs and their extents, a template generates the appro- 
priate sequence of primitive transformations within the Schemas & Trans- 
formations Repository - see [5] for details. 

(b) Consider now a schema S expressed in a modelling language M. which evolves 
into an equivalent schema S new expressed in a modelling language Ai new . We 
can express this translation by a series of add steps that define the constructs 
of S new in A4 new in terms of the constructs of S in A4. At this stage, we have 
an intermediate schema that contains the constructs of both S and S new . 
We then specify a series of delete steps that remove the constructs of A4 (the 
queries within these transformations indicate that these are now redundant 
constructs since they can be derived from the new constructs). 

For example, suppose that XML schema LS 2 in the above example evolves 
into an equivalent relational schema consisting of single table with one col- 
umn per attribute of LS 2 . This evolution is captured by a pathway which is 
identical to the pathway LS 2 —■ ► CS 2 given above. 

Again, such generic inter-model translations between one data model and 
another can be defined in AutoMed by means of transformation templates. 

(c) Considering finally to an evolution which is both a change in the schema 
and in the data model, this can be expressed by a combination of (a) and 
(b) above: either (a) followed by (b), or (b) followed by (a), or indeed by 
interleaving the two processes. 

4 Handling Schema Evolution 

In this section we consider how the general integration network illustrated in 
Figure 1 is evolvable in the face of evolution of a local schema or the warehouse 
schema. We have seen in the previous section how AutoMed transformations can 
be used to express the schema evolution if either the schema or the data model 
changes, or both. We can therefore treat schema and data model change in a 
uniform way for the purposes of handling schema evolution: both are expressed 
as a sequence of AutoMed primitive transformations, in the first case staying 
within the original data model, and in the second case transforming the original 
schema in the original data model into a new schema in a new data model. 

In this section we describe the actions that are taken in order to evolve the 
integration network of Figure 1 if the global schema GS evolves (Section 4.1) or 
if a local schema LSi evolves (Section 4.2). Given an evolution pathway from a 
schema S' to a schema S new , in both cases each successive primitive transforma- 
tion within the pathway S — » S new is treated one at a time. Thus, we describe 
in sections 4.1 and 4.2 the actions that are taken if S — * S new consists of just 
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one primitive transformation. If S — » S new is a composite transformation, then 
it is handled as a sequence of primitive transformations. Our discussion below 
assumes that the primitive transformation being handled is adding, removing 
or renaming a construct of S that has an underlying data extent. We do not 
discuss the addition or removal of constraints here as these do not impact on 
the materialised data, and we make the assumption that any constraints in the 
pathway S — > S new have been verified as being valid. 

4.1 Evolution of the Global Schema 

Suppose the global schema GS evolves by means of a primitive transformation 
t into GS new . This is expressed by the step t being appended to the pathway 
T u of Figure 1. The new global schema is GS new and its associated extension is 
GD new . GS is now an intermediate schema in the extended pathway T u \t and 
it no longer has an extension associated with it. t may be a rename, add, extend, 
delete or contract transformation. The following actions are taken in each case: 

1. If t is rename c c', then there is nothing further to do. GS is semantically 
equivalent to GS new and GD new is identical to GD except that the extent 
of c in GD is now the extent of c' in GD new . 

2. If t is add c q, then there is nothing further to do at the schema level. GS is 
semantically equivalent to GS new . However, the new construct c in GD new 
must now be populated, and this is achieved by evaluating the query q over 
GD. 

3. If t is extend c, then the new construct c in GD new is populated by an empty 
extent. This new construct may subsequently be populated by an expansion 
in a data source (see Section 4.2). 

4 . If t is delete c q or contract c, then the extent of c must be removed from GD 
in order to create GD new (it is assumed that this a legal deletion/contraction, 
e.g if we wanted to delete/contract a table from a relational schema, then 
first the constraints and then the columns would be deleted/contracted and 
lastly the table itself; such syntactic correctness of transformation pathways 
is automatically verified by AutoMed). It may now be possible to simplify 
the transformation network, in that if T u contains a matching transformation 
add c q or extend c, then both this and the new transformation t can be 
removed from the pathway US — > GS new . This is purely an optimization - it 
does not change the meaning of a pathway, nor its effect on view generation 
and query/data translation. We refer the reader to [19] for details of the 
algorithms that simplify AutoMed transformation pathways. 

In cases 2 and 3 above, the new construct c will automatically be prop- 
agated into the schema DMS of any data mart derived from GS . To prevent 
this, a transformation contract c can be prefixed to the pathway GS — > DMS. 
Alternatively, the new construct c can be propagated to DMS if so desired, and 
materialised there. In cases 1 and 4 above, the change in GS and GD may 
impact on the data marts derived from GS , and we discuss this in Section 4.3. 
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4.2 Evolution of a Local Schema 

Suppose a local schema LSi evolves by means of a primitive transformation t 
into LS™ ew . As discussed in Section 2, there is automatically available a reverse 
transformation t from LS™ ew to LSi and hence a pathway t;Ti from LS™ ew to 
CSi. The new local schema is LS™ ew and its associated extension is LD™ ew . 
LSi is now just an intermediate schema in the extended pathway f ; T and it no 
longer has an associated extension. 

t may be a rename, add, delete, extend or contract transformation. In 1-5 below 
we see what further actions are taken in each case for evolving the integration 
network and the downstream materialised data as necessary. 

We first introduce some necessary terminology: If p is a pathway S — > S' and 
c is a construct in S, we denote by descendcints(c, p) the constructs of S' which 
are directly or indirectly dependent on c, either because c itself appears in S' 
or because a construct S of S’ is created by a transformation add c' q within p 
where the query q directly or indirectly references c. The set descendants{c,p) 
can be straight-forwardly computed by traversing p and inspecting the query 
associated with each add transformation within in. 

1 . If t is rename c c', then schema L5” e?u is semantically equivalent to LSi. The 
new transformation pathway T,P ,ew : LS™ ew —>CSi is t;Ti = rename c' c;7j. 
The new local database LD™ ew is identical to LDi except that the extent of 
c in LDi is now the extent of c' in LD^ ew . 

2. If t is add c q , then LSi has evolved to contain a new construct c whose 
extent is equivalent to the expression q over the other constructs of LSi. 
The new transformation pathway T.p ew : LS™ ew —*CSi is t; Tj = delete c q\ Ti. 

3. If t is delete c q, this means that LSi has evolved to not include a construct 
c whose extent is derivable from the expression q over the other constructs 
of LSi , and the new local database LD” ew no longer contains an extent for 
c. The new transformation pathway T™ ew : LS^ ew -^CSi is f; Ti = add c q ; Ti. 

In the above three cases, schema LS^ ew is semantically equivalent to LSi, 
and nothing further needs to be done to any of the transformation pathways, 
schemas or databases CD\, . . . , CD n and GD. This may not be the case if t is 
a contract or extend transformation, which we consider next. 

4- If t is extend c, then there will be a new construct available from L5" 61 " 
that was not available before. That is, LSi has evolved to contain the new 
construct c whose extent is not derivable from the other constructs of LSi. 
If we left the transformation pathway T, as it is, this would result in a 
pathway Tp ew = contract c; Ti from LS™ ew to CS t , which would immediately 
drop the new construct c from the integration network. That is, T i rae ' u ' is 
consistent but it does not utilize the new data. 

However, recall that we said earlier that we assume no contract steps in the 
pathways from local schemas to their union schemas, and that all the data in 
LSi should be available to the integration network. In order to achieve this, there 
are four cases to consider: 
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(a) c appears in U Si and has the same semantics as the newly added c in LS™ ew . 
Since c cannot be derived from the original LSi, there must be a transfor- 
mation extend c, in CSi — > U Si. 

We remove from T™ ew the new contract c step and this matching extend c 
step. This propagates c into CSi, and we populate its extent in the materi- 
alised database CDi by replicating its extent from LD™ ew . 

(b) c does not appear in U Si but it can be derived from USi by means of some 
transformation T. 

In this case, we remove from T/ le ' u ' the first contract c step, so that c is now 
present in CS t and in U Si. We populate the extent of c in CDi by replicating 
its extent from LD™ ew . 

To repair the other pathways Tj : LSj — ■> CSj and schemas USj for j ^ i, 
we append T to the end of each Tj. As a result, the new construct c now 
appears in all the union schemas. To add the extent of this new construct to 
each materialised database CDj for j i, we compute it from the extents 
of the other constructs in CSj using the queries within successive add steps 
in T. 

We finally append the necessary new id steps between pairs of union schemas 
to assert the semantic equivalence of the construct c within them. 

(c) c does not appear in USi and cannot be derived from USj. 

In this case, we again remove from Tf 1 ™ the first contract c step so that c is 
now present in schema CSi. 

To repair the other pathways Tj : LSj — > CSj and schemas U Sj for j y^ i, we 
append an extend c step to the end of each Tj. As a result, the new construct 
c now appears in all the conformed schemas CS 1 , . . . , CS n . 

The construct c may need further translation into the data model of the 
union schemas and this is done by appending the necessary sequence, T, of 
add/delete/rename steps to all the pathways LS\ — » CS\, . . . , LS n — » CS n . 
We compute the extent of c within the database CDi from its extent within 
LD\ lew using the queries within successive add steps in T. 

We finally append the necessary new id steps between pairs of union schemas 
to assert the semantic equivalence of the new construct(s) within them. 

(d) c appears in USi but has different semantics to the newly added c in LS™ ew . 
In this case, we rename c in LS™ ew to a new construct c' . The situation 
reverts to adding a new construct d to LS™ ew , and one of (a)-(c) above 
applies. 

We note that determining whether c can or cannot be derived from the 
existing constructs of the union schemas in (a)-(d) above requires domain or 
expert human knowledge. Thereafter, the remaining actions are fully automatic. 

In cases (a) and (b), there is new data added to one or more of the con- 
formed databases which needs to be propagated to CD. This is done by com- 
puting des Cendant s(c,T u ) and using the algebraic equivalences of Section 2.1 to 
propagate changes in the extent of c to each of its descendant constructs gc in 
GS. Using these equivalences, we can in most cases incrementally recompute the 
extent of gc. If at any stage in T u there is a transformation add c' q where no 
equivalence can be applied, then we have to recompute the whole extent of c! . 
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In cases (b) and (c), there is a new schema construct c appearing in the U Si. 
This construct will automatically appear in the schema GS. If this is not desired, 
a transformation contract c can be prefixed to T u . 

5. If t is contract c, then the construct c in LS t will no longer be available 
from L5" e? ". That is, LSi has evolved so as to not include a construct c 
whose extent is not derivable from the other constructs of LSi. The new 
local database LD™ ew no longer contains an extent for c. 

The new transformation pathway T™ ew : LS™ ew ~^>CS i is t: Tj = extend c;Tj. 
Since the extent of c is now Void, the materialised data in CDi and GD must 
be modified so as to remove any data derived from the old extent of c. 

In order to repair CDi , we compute descendants(c,LSi—>CSi). For each 
construct uc in des Cendant s(c, LS^CSi), we compute its new extent and 
replace its old extent in CDi by the new extent. Again, the algebraic prop- 
erties of IQL queries discussed in Section 2.1 can be used to propagate the 
new Void extent of construct c in LS™ ew to each of its descendant constructs 
uc in CSi . Using these equivalences, we can in most cases incrementally 
recompute the extent of uc as we traverse the pathway Tj. 

In order to repair GD, we similarly propagate changes in the extent of each 
uc along the pathway T u . 

Finally, it may also be necessary to amend the transformation pathways 
if there are one or more constructs in GD which now will always have an 
empty extent as a result of this contraction of LSi. For any construct uc 
in U S whose extent has become empty, we examine all pathways T \ , . . . , 
T n . If all these pathways contain an extend uc transformation, or if using 
the equivalences of Section 2.1 we can deduce from them that the extent 
of uc will always be empty, then we can suffix a contract gc step to T u for 
every gc in descendants{uc,T u ) , and then handle this case as paragraph 4 
in Section 4.1. 



4.3 Evolution of Downstream Data Marts 

We have discussed how evolutions to the global schema or to a source schema 
are handled. One remaining question is how to handle the impact of a change to 
the data warehouse schema, and possibly its data, on any data marts that have 
been derived from it. 

In [7] we discuss how it is possible to express the derivation of a data marts 
from a data warehouse by means of an AutoMed transformation pathway. Such 
a pathway GS — > DMS expresses the relationship of a data mart schema DMS 
to the warehouse schema GS. As such, this scenario can be regarded as a special 
case of the general integration scenario of Figure 1, where GS now plays the role 
of the single source schema, databases CD\, . . . , CD n and GD collectively play 
the role of the data associated with this source schema and DMS plays the role 
of the global schema. Therefore, the same techniques as discussed in sections 4.1 
and 4.2 can be applied. 
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5 Concluding Remarks 

In this paper we have described how the AutoMed heterogeneous data integra- 
tion toolkit can be used to handle the problem of schema evolution in hetero- 
geneous data warehousing environments so that the previous transformation, 
integration and data materialisation effort can be reused. Our algorithms are 
mainly automatic, except for the aspects that require domain or expert human 
knowledge regarding the semantics of new schema constructs. 

We have shown how AutoMed transformations can be used to express schema 
evolution within the same data model, or a change in the data model, or both, 
whereas other schema evolution literature has focussed on just one data model. 
Schema evolution within the relational data model has been discussed in pre- 
vious work such as [11,12,18]. The approach in [18] uses a first-order schema 
in which all values in a schema of interest to a user are modelled as data, and 
other schemas can be expressed as a query over this first-order schema. The 
approach in [12] uses the notation of a flat scheme , and gives four operators 
Unite, Fold, Unfold and Split to perform relational schema evolution us- 
ing the SchemaS QL language. In contrast, with AutoMed the process of schema 
evolution is expressed using a simple set of primitive schema transformations 
augmented with a functional query language, both of which are applicable to 
multiple data models. 

Our approach is complementary to work on mapping composition, e.g. [20, 
14], in that in our case the new mappings are a composition of the original 
transformation pathway and the transformation pathway which expresses the 
schema evolution. Thus, the new mappings are, by definition, correct. There are 
two aspects to our approach: (i) handling the transformation pathways and (ii) 
handling the queries within them. In this paper we have in particular assumed 
that the queries are expressed in IQL. However, the AutoMed toolkit allows any 
query language syntax to be used within primitive transformations, and therefore 
this aspect of our approach could be extended to other query languages. 

Materialised data warehouse views need to be maintained when the data 
sources change, and much previous work has addressed this problem at the data 
level. However, as we have discussed in this paper, materialised data warehouse 
views may also need to be modified if there is an evolution of a data source 
schema. Incremental maintenance of schema-restructuring views within the re- 
lational data model is discussed in [10], whereas our approach can handle this 
problem in a heterogeneous data warehousing environment with multiple data 
models and changes in data models. Our previous work [7] has discussed how 
AutoMed transformation pathways can also be used for incrementally maintain- 
ing materialised views at the data level. For future work, we are implementing 
our approach and evaluating it in the context of biological data warehousing. 
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Abstract. For systems that share enough structural and functional commonal- 
ities, reuse in schema development and data manipulation can be achieved by 
defining problem-oriented languages. Such languages are often called domain- 
specific, because they introduce powerful abstractions meaningful only within 
the domain of observed systems. In order to use domain-specific languages for 
database applications, a mapping to SQL is required. In this paper, we deal with 
metaprogramming concepts required for easy definition of such mappings. Us- 
ing an example domain-specific language, we provide an evaluation of mapping 
performance. 



1 Introduction 

A large variety of approaches use SQL as a language for interacting with the database, 
but at the same time provide a separate problem-oriented language for developing 
database schemas and formulating queries. A translator maps a statement in such 
problem-oriented language to a series of SQL statements that get executed by the 
DBMS. An example of such a system is Preference SQL, described by KieBling and 
Kostler [8], Preference SQL is an SQL extension that provides a set of language con- 
structs which support easy use of soft preferences. This kind of preferences is useful 
when searching for products and services in diverse e-commerce applications where a 
set of strictly observed hard constraints usually results in an empty result set, although 
products that approximately match the user’s demands do exist. The supported con- 
structs include approximation (clauses AROUND and BETWEEN), minimization/maxi- 
mization (clauses LOWEST, HIGHEST), favorites and dislikes (clauses POS, NEG), 
pareto accumulation (clause AND), and cascading of preferences (clause CASCADE) 
(see [8] for examples). 

In general, problem-oriented programming languages are also called domain-spe- 
cific languages (DSLs), because they prove useful when developing and using systems 
from a predefined domain. The systems in a domain will exhibit a range of similar 
structural and functional features (see [4,5] for details), making it possible to describe 
them (and, in our case, query their data) using higher-level programming constructs. In 
turn, these constructs carry semantics meaningful only within this domain. As the 
activity of using these constructs is referred to as programming, defining such con- 
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structs and their mappings to languages that can be compiled or interpreted to allow 
their execution is referred to as metaprogramming. 

This paper focuses on the application of metaprogramming for relational databases. 
In particular, we are interested in concepts that guide the implementation of fast map- 
pings of custom languages, used for developing database schemas and manipulating 
data, onto SQL-DDL and SQL-DML. The paper is structured as follows. First, in 
Sect. 2, we further motivate the need for DSLs for data management. An overview of 
related work is given by Sect. 3. Our system prototype (DSL-DA - domain-specific 
languages for database applications) that supports the presented ideas is outlined in 
Sect. 4. A detailed performance evaluation of a DSL for the example product line will 
be presented in Sect. 5. Sect. 6 gives a detailed overview of metaprogramming con- 
cepts. Finally, in Sect. 7, we summarize our results and give some ideas for the future 
work related to our approach. 



2 Domain- Specific Languages 

The idea of DSLs is tightly related to domain engineering. According to Czarnecki and 
Eisenecker [5], domain engineering deals with collecting, organizing, and storing past 
experience in building systems in form of reusable assets. In general, we can rely that 
a given asset can be reused in a new system in case this system possesses some struc- 
tural and functional similarity to previous systems. Indeed, systems that share enough 
common properties are said to constitute a system family (a more market-oriented term 
for a system family is a software product-line ). Examples of software product-lines are 
extensively outlined by Clements and Northrop [4] and include satellite controllers, in- 
ternal combustion engine controllers, and systems for displaying and tracing stock-mar- 
ket data. Further examples of more data-centric product lines include CRM and ERP 
systems. Our example product line for versioning systems will be introduced in Sect. 4. 

Three approaches can be applied to allow the reuse of “assets” when developing da- 
tabase schemas for systems in a data-intensive product line. 

Components'. Schema components can be used to group larger reusable parts of a da- 
tabase schema to be used in diverse systems afterwards (see Thalheim [16] for an ex- 
tensive overview of this approach). Generally, the modularity of system specification 
(which components are to be used) directly corresponds to the modularity of the result- 
ing implementation, because a component does not influence the internal implementa- 
tion of other components. This kind of specification transformations towards the imple- 
mentation is referred to as vertical transformations ox forward refinements [5]. 

Frameworks: Much like software frameworks in general (see, for example, Apache 
Struts [1] or IBM San Francisco [2]), schema frameworks rely on the user to extend 
them with system-specific parts. This step is called framework instantiation and re- 
quires certain knowledge of how the missing parts will be called by the framework. 
Most often, this is achieved by extending superclasses defined by the framework or im- 
plementing call-back methods which will be invoked by mechanisms such as reflection. 
In a DBMS, application logic otherwise captured by such methods can be defined by 
means of constraints, trigger conditions and actions, and stored procedures. A detailed 
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overview of schema frameworks is given by Mahnke [9]. Being more flexible than 
components, frameworks generally require more expertise from the user. Moreover, 
due to performance reasons, most DBMSs restrain from dynamic invocation possibili- 
ties through method overloading or reflection (otherwise supported in common OO pro- 
gramming languages). For this reason, schema frameworks are difficult to implement 
without middleware acting as a mediator for such calls. 

Generators: Schema generators are, in our opinion, the most advanced approach to 
reuse and are the central topic of this paper. A schema generator acts much like a com- 
piler: It transforms a high-level specification of the system to a schema definition, pos- 
sibly equipped with constraints, triggers, and stored procedures. In general, the modu- 
larity of the specification does not have to be preserved. Two modular parts of the spec- 
ification can be interwoven to obtain a single modular part in the schema (these 
transformations are called horizontal transformations', in case the obtained part in the 
schema is also refined, for example, columns not explicitly defined in the specification 
are added to a table, this is called an oblique transformation, i.e., a combination of a hor- 
izontal and a vertical transformation.) 

It is important to note that there is no special “magic” associated with schema gen- 
erators that allows them to obtain a ready-to-use schema out of a short specification. By 
narrowing the domain of systems, it is possible to introduce very powerful language ab- 
stractions that are used at the specification level. Due to similarities between systems, 
these abstractions aggregate a lot of semantics that is dispersed across many schema el- 
ements. Because defining this semantics in SQL-DDL proves labour-intensive, we rath- 
er choose to define a special domain-specific DDL (DS-DDL) for specifying the sche- 
ma at a higher level of abstraction and implement the corresponding mapping to SQL- 
DDL. The mapping represents the “reusable asset” and can be used with any schema 
definition in this DS-DDL. The data manipulation part complementary to DS-DDL is 
called DS-DML and allows the use of domain-specific query and update statements in 
application programs. Defining custom DS-DDLs and their mappings to SQL-DDL as 
well as fast translation of DS-DML statements is the topic we explore in this paper. 



3 Related Work 

Generators are the central idea of the OMG’s Model Driven Architecture (MDA) [13] 
which proposes the specification of systems using standardized modeling languages 
(UML) and automatic generation of implementations from models. However, even 
OMG notices the need of supporting custom domain-specific modeling languages. As 
noted by Frankel [6], this can be done in three different ways: 

• Completely new modeling languages: A new DSL can be obtained by defining a 
new MOF-based metamodel. 

• Heavyweight language extensions: A new DSL can be obtained by extending the 
elements of a standardized metamodel (e.g., the UML Metamodel). 

• Lightweight language extensions: A new DSL can be obtained by defining new 
language abstractions using the language itself In UML, this possibility is support- 
ed by UML Profiles. 
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The research area that deals with developing custom (domain-specific) software en- 
gineering methodologies well suited for particular systems is called computer-aided 
method engineering (CAME) [ 14] . CAME tools allow the user to describe an own mod- 
eling method and afterwards generate a CASE tool that supports this method. For an ex- 
ample of a tool supporting this approach, see MetaEdit+ [11]. 

The idea of a rapid definition of domain-specific programming languages and their 
mapping to a platform where they can be executed is materialized in Simonyi’s work 
on Intentional Programming (IP) [5,15]. IP introduces an IDE based on active libraries 
that are used to import language abstractions (also called intentions) into this environ- 
ment. Programs in the environment are represented as source graphs in which each node 
possesses a special pointer to a corresponding abstraction. The abstractions define ex- 
tension methods which are metaprograms that specify the behavior of nodes. The fol- 
lowing are the most important extension methods in IP. 

• Rendering and type-in methods. Because it is cumbersome to edit the source graph 
directly, rendering methods are used to visualize the source graph in an editable no- 
tation. Type-in methods convert the code typed in this notation back to the source 
graph. This is especially convenient when different notations prove useful for a sin- 
gle source graph. 

• Refactoring methods. These methods are used to restructure the source graph by 
factoring out repeating code parts to improve reuse. 

• Reduction methods. The most important component of IP, these methods reduce 
the source graph to a graph of low-level abstractions (also called reduced code or 
R-code ) that represent programs executable on a given platform. Different reduc- 
tion methods can be used to obtain the R-code for different platforms. 

How does this work relate to our problem? Similar as in IP, we want to support a cus- 
tom definition of abstractions that form both a custom DS-DDL and a custom DS-DML. 
We want to support the rendering of source graphs for DS-DDL and DS-DML state- 
ments to (possibly diverse) domain-specific textual representations. Most importantly, 
we want to support the reduction of these graphs to graphs representing SQL statements 
that can be executed by a particular DBMS. 



4 DSL-DA System 

In our DSL-DA system, the user starts by defining a domain- specific (DS) metamodel 
that describes language abstractions that can appear in the source graph (the language 
used for defining metamodels is a simplified variant of the MOF Model) for the DS- 
DDL. We used the system to fully implement a DSL for the example product line of ver- 
sioning systems which we also use in the next section for the evaluation of our approach. 
In this product line, each system is used to store and version objects (of some object 
type) and relationships (of some relationship type). Thus individual systems differ in 
their type definitions (also called information models [3]) as well as other features il- 
lustrated in the DS-DDL metamodel in Fig. 1 and explained below. 
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Fig. 1 . DS-DDL metamodel for the example product line 



• Object types can be versioned or unversioned. The number of direct successors to 
a version can be limited to some number ( maxSuccessors ) for a given versioned ob- 
ject type. 

• Relationship types connect to object types using either non-floating or floating re- 
lationship ends. A non-floating relationship end connects directly to a particular 
version as if this version were a regular object. On the other hand, a floating rela- 
tionship end maintains a user-managed subset of all object versions for each con- 
nected object. Such subsets are called candidate version collections (CVC) and 
prove useful for managing configurations. In unfiltered navigation from some ori- 
gin object, all versions contained in every connected CVC will be returned. In fil- 
tered navigation, a version preselected for each CVC (also called the pinned ver- 
sion) will be returned. In case there is no pinned version, we return the latest ver- 
sion from the CVC. 

• Workspace objects act as containers for other objects. However, only one version 
of a contained object can be present in the workspace at a time. In this way, work- 
spaces allow a version-free view to the contents of a versioning system. When ex- 
ecuted within a workspace, filtered navigation returns versions from the CVC that 
are connected to this workspace and ignores the pin setting of the CVC. 

• Operations create object, copy, delete, create successor, attach/detach (connects/ 
disconnects an object to/from a workspace), freeze, and checkout/checkin (locks/ 
unlocks the object) can propagate across relationships. 

A model expressed using the DS-DDL metamodel from Fig. 1 will represent a source 
graph for a particular DS-DDL schema definition used to describe a given versioning 
system. To work with these models (manipulate the graph nodes), DSL-DA uses the 
DS-DDL metamodel to generate a schema editor that displays the graphs in a tree-like 
form (see the left-hand side of Fig. 2). A more convenient graphical notation of a source 
graph for our example versioning system that we will use for the evaluation in the next 
section is illustrated in Fig. 3. 

The metamodel classes define rendering and type-in methods that render the source 
graph to a textual representation and allow its editing (right-hand side of Fig. 2). More 
importantly, the metamodel classes define reduction methods that will reduce the 
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Fig. 2. DS-DDL schema development with the generated editor 




Fig. 3. Example DS-DDL schema used in performance evaluation 



source graph to its representation in SQL-DDL. In analogy with the domain-specific 
level of the editor, the obtained SQL-DDL schema is also represented as a source graph; 
the classes used for this graph are the classes defined by the package Relational of the 
OMG’s Common Warehouse Metamodel (CWM) [12]. The rendering methods of these 
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Table 1 . Examples of DS-DML statements 



Statement 


Explanation 


SELECT Task.* 

FROM Department-consistsOf->Employee- 

pyp/'i itpo_^,Tac:k 

WHERE Department.globalld = 502341 


Get all tasks executed by employees of a given department 
(all three objects are versioned). Note that the fact that the 
relationship end executes is floating (i.e. filtered navigation 
will be used) is transparent for the user. 


CREATE SUCCESSOR OF OBJECT Task 
USE WORKSPACE Project 
WHERE globalld = 235711 
AND Task WHERE objectld = 982 


Create a successor version to a version of a task. The ver- 
sion graph for the task is identified by the objectld. The suc- 
cessor is to be created to the version attached to the 
workspace with a given globalld. Note that according to the 
DS-DDL schema, the operation will propagate to connected 
costs. 


GET ALTERNATIVES OF Employee 
WHERE globalld = 234229 


Get the alternative versions (versions that have the same 
predecessor) of a given employee version 



classes are customizable so that by rendering the SQL-DDL source graphs, SQL-DDL 
schemas in SQL dialects of diverse DBMS vendors can be obtained. 

Once an SQL-DDL schema is installed in a database, how do we handle statements 
in DS-DML (three examples of such statements are given by Table 1)? As for the DS- 
DDL, there is a complementary DS-DML metamodel that describes language abstrac- 
tions of the supported DS-DML statements. This metamodel can be simply defined by 
first coming up with an EBNF for DS-DML and afterwards translating the EBNF sym- 
bols to class definitions in a straightforward fashion. The EBNF of our DS-DML for the 
sample product line for versioning systems is available through [17]. DS-DML state- 
ments can then be represented as source graphs, where each node in the graph is an in- 
stance of some class from the DS-DML metamodel. Again, metamodel classes define 
reduction methods that reduce the corresponding DS-DML source graph to an SQL- 
DML source graph, out of which SQL-DML statements can be obtained through ren- 
dering. 

DS-DML is used by an application programmer to embed domain-specific queries 
and data manipulation statements in the application code. In certain cases, the general 
structure of a DS-DML statement will be known at the time the application is written 
and the parameters of the statement will only need to be filled with user-provided values 
at run time. Since these parameters do not influence the reduction, the reduction from 
DS-DML to SQL-DML can take place using a precompiler. Sometimes, however, es- 
pecially in the case of Web applications, the structure of the DS-DML query will de- 
pend on the user’s search criteria and other preferences and is thus not known at compile 
time. The solution in this case is to wrap the native DBMS driver into a domain-specific 
driver that performs the reduction at run time, passes the SQL-DML statements to the 
native driver, and restructures the result sets before returning them to the user, if neces- 
sary. To handle both cases where query structure is known at compile time and when it 
is not, DSL-DA can generate both the precompiler and the domain-specific driver from 
the DS-DML metamodel, its reduction methods, and its rendering methods for SQL- 
DML. We assumed the worst-case scenario in which all SQL-DML statements need to 
be reduced at run time for our evaluation in the next section to examine the effect of run 
time reduction in detail. 
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5 Evaluation of the Example Product Line 

The purpose of the evaluation presented in this section is to demonstrate the following. 

• Even for structurally complex DS-DML statements, the reduction process carried 
out at run time represents a very small proportion of costs needed to carry out the 
SQL-DML statements obtained by reduction. 

• DS-DDL schemas that have been reduced to SQL-DDL with certain optimizations 
in mind imply reduction that is more difficult to implement. Somewhat surprising- 
ly, this does not necessarily mean that such reduction will also take more process- 
ing time. Optimization considerations can significantly contribute to a faster exe- 
cution of DS-DML statements once reduced to SQL-DML. 

To demonstrate both points, we implemented four very different variants of both DS- 
DDL and DS-DML reduction methods for the example product line. The DS-DDL 
schema from Lig. 3 has thus been reduced to four different SQL-DDL schemas. In all 
four variants, object types from Lig. 3 are mapped to tables (called object tables ) with 
the specified attributes. An object version is then represented as a tuple in this table. The 
identifiers in each object table include an objectld (all versions of a particular object, 
i.e., all versions within the same version tree, possess the same objectld ), a versionld 
(identifies a particular version within the version tree) and a globalld, which is a com- 
bination of an objectld and a versionld. The four reductions differ in the following way. 

• Variant 1 : Store all relationships, regardless of relationship type, using a single 
“generic” table. Lor a particular relationship, store the origin globalld, objectld, 
versionld and the target rolename, globalld, objectld, and versionld as columns. 
Use an additional column as a flag denoting whether the target version is pinned. 

• Variant 2: Use separate tables for every relationship type. In case a relationship 
type defines no floating ends or two floating ends, this relationship type can be rep- 
resented by a single table. In case only one relationship end is floating, such rela- 
tionship type requires two tables, one for each direction of navigation. 

• Variant 3 : Improve Variant 2 by considering maximal multiplicity of 1 on non- 
floating ends. Lor such ends, the globalld of the connected target object is stored 
as a column in the object table of the origin object. 

• Variant 4\ Improve Variant 3 by considering maximal multiplicity of 1 of floating 
ends. Lor such ends, the globallds of the pinned version and the latest version of 
the CVC for the target object can be stored as columns in the object table of the 
origin object. 

Our benchmark, consisting of 141,775 DS-DML statements was then run using four 
different domain-specific drivers corresponding to four different variants of reduction. 
To eliminate the need of fetching metadata from the database, we assumed that, once 
defined, the DS-DDL schema does not change, so each driver accessed the DS-DDL 
schema defined in Lig. 3 directly in the main memory. The overall time for executing a 
DS-DML statement is defined as t DS = t par +t rec ]+t ren +t S Q L , where t par is the required 
DS-DML parsing time, t red the time required for reduction, t ren the time required for 
rendering all resulting SQL-DML statements, and t S Q L the time used to carry out these 
statements. Note that t par is independent of the variant, so we were mainly interested in 
the remaining three times as well as the overall time. The average t DS , t re(t , t ren and t^jj 
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Fig. 6. Overhead due to DS-DML parsing, reduction and rendering 

values (in ps) for the category of select statements are illustrated in Fig. 4. This category 
included queries over versioned data within and outside workspaces that contained up 
to four navigation steps. As evident from Fig. 4, Variant 4 demonstrates a very good 
t S Q L performance and also allows the fastest reduction. On the other hand, due to mate- 
rialization of the globallds of pinned and latest versions for CVCs in Variant 4, 
Variant 2 proves faster for manipulation (i.e., creation and deletion of relationships). 
The values for the category of create relationship statements are illustrated in Fig. 5. 

Most importantly, the overhead time required due to the domain-specific driver 
{ dr = l pa r +t recl + 1 re n proves to be only a small portion of t DS . As illustrated in Fig. 6, when 
using Variant 4, the portion t c i,Jt DS is lowest (0.8%) for the category of select statements 
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Fig. 7. Properties of reduction methods 



and highest (9.9%) for merge statements. When merging two versions (denoted as pri- 
mary and secondary version), their attribute values have to be compared to their so- 
called base (latest common) version in the version graph to decide which values should 
be used for the result of the merging. This comparison, which is performed in the driver, 
accounts for a high t red value (9.1% of t DS ). Note that t S Q L is the minimal time an ap- 
plication spends executing SQL-DML statements in any case (with or without DS- 
DML available) to provide the user with equivalent results: Even without DS-DML, the 
programmer would have to implement data flows to connect sequences of SQL-DML 
statements to perform a given operation (in our evaluation, we treat data flows as part 

^ red )■ 

How difficult is it to implement the DS-DML reduction methods? To estimate this 
aspect, we used measures such as the count of expressions, statements, conditional 
statements, loops, as well as McCabe’s cyclomatic complexity [ 10] and Halstead 
effort [7] on our Java implementation of reduction methods. The summarized results 
obtained using these measures are illustrated by Lig. 7. All measures, except for the 
count of loops confirm an increasing difficulty to implement the reduction (e.g., the 
Halstead effort almost doubles from Variant 1 to Variant 4). Is there a correlation be- 
tween the Halstead effort for writing a method and the times t m j and t S Q L l We try to 
answer this question in Lig. 8. Somewhat surprisingly, a statement with a reduction 
more difficult to implement will sometimes also reduce faster (i.e., an increase in Hal- 
stead effort does not necessarily imply an increase in t red ), which is most evident for the 
category of select statements. The explanation is that even though the developer has to 
consider a large variety of different reductions for a complex variant (e.g., Variant 4), 
once the driver has found the right reduction (see Sect. 6), the reduction can proceed 
even faster than for a variant with less optimization considerations (e.g.. Variant 1). Lor 
all categories in Lig. 8, a decreasing trend for t S Q L values can be observed. However, in 
categories that manipulate the state of the CVC (note that operations from the category 
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copy object propagate across relationships and thus manipulate the CVCs), impedance 
due to materializing the pin setting and the latest version comes into effect and often 
results in only minor differences in t S g L values among Variants 2-4. 



6 Metaprogramming Concepts 

Writing metacode is different and more difficult than writing code, because the pro- 
grammer has to consider a large variety of cases that may occur depending on the form 
of the statement and the properties defined in the DS-DDL schema. 

Our key idea to developing reduction methods is the so-called reduction polymor- 
phism. In OO programming languages, polymorphism supports dynamic selection of 
the “right” method depending on the type of object held by a reference (since the type 
is not known until run time, this is usually called late binding). In this way, it is possible 
to avoid disturbing conditional statements (explicit type checking by the programmer) 
in the code. In a similar way, we use reduction polymorphism to avoid explicit use of 
conditional statements in metacode. This means that for an incoming DS-DML state- 
ment, the domain-specific driver will execute reduction methods that (a) match the syn- 
tactic structure of the statement and (b) apply for the specifics of the DS-DDL schema 
constructs used in the statement. We illustrate both concepts using a practical example. 

Suppose the following DS-DML statement. 

1: SELECT Cost.* 

2: FROM Offer-contains->Task-ratedCosts->Cost 

3: USE WORKSPACE Project WHERE globalld = 435532 AND Offer WHERE objectld = 122; 
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Using our DS-DDL schema from Fig. 3 and reduction Variant 4, the statement gets 
reduced to the following SQL-DML statement (OT denotes object table, ATT the attach- 
ment relationship table, F a floating end, and nf a non-floating end). 

1: SELECT CostOT.globalld, CostOT.objectld, CostOT.versionld, CostOT.wages, CostOT.travelExpenses, 

2: CostOT.materialExpenses, CostOT.validUntil, CostOT.isFrozen 

3: FROM OfferOT, TaskOT, CostOT, isPartOfF_containsNF, causedByF_ratedCostsF, 

4: ProjectOT, Project_OfferATT, Project_TaskATT, Project_CostATT 

5: WHERE OfferOT.globalld = isPartOfF_containsNF.isPartOfGloballd 
6: AND isPartOfF_containsNF.containsGloballd = TaskOT.globalld 

7: AND TaskOT.globalld = causedByF_ratedCostF.causedByGloballd 

8: AND causedByF_ratedCostF.ratedCostsGloballd = CostOT.globalld 

9: AND Project_OfferATT.projectGloballd = 435532 

10: AND Project_OfferATT.offerGloballd = OfferOT.globalld 

11: AND Project_TaskATT.projectGloballd = 435532 

12: AND Project_TaskATT.taskGloballd = TaskOT.globalld 

13: AND Project_CostATT.projectGloballd = 435532 

14: AND Project_CostATT.costGloballd = CostOT.globalld 

15: AND Offer.objectld = 122 

First, any select statement will match a very generic reduction method that will in- 
sert select and from clauses into the SQL-DML source graph. A reduction method 
on the projection clause (Cost . *) will reduce to a projection of identifiers ( globalld , 
objectld, and versionld), user-defined attributes and the flag denoting whether the ver- 
sion is frozen. Note that because the maximal multiplicity of the end causedBy pointing 
from Cost to Task is 7, the table CostOT also contains the materialization of a pinned 
or latest version of some task, but the column for this materialization is left out in the 
projection, because it is irrelevant for the user. Next, a reduction method is invoked on 
the DS-DML from clause, which itself calls reduction methods on two DS-DML sub- 
nodes, one for each navigation step. Thus, the reduction of Of fer- contains ->Task 
results in conditions in lines 5-6 and the reduction of Task-ratedCosts->Cost re- 
sults in conditions in lines 7-8. The reductions carried out in this example rely on two 
mechanisms, DS-DDL schema divergence and source-graph divergence. 

DS-DDL schema divergence is applied in the following way. The relationship type 
used in the first navigation step defines only one floating end while the one used in the 
second navigation step defines both ends as floating. Thus in the reduction of DS- 
DDL, we had to map the first relationship type to two distinct tables (because relation- 
ships with only one floating end are not necessarily symmetric). Therefore, the choice 
of the table we use (isPartOf F_containsNF) is based on the direction of naviga- 
tion. The situation would be even more different in case the multiplicity defined for the 
non-floating end would be 1 , where we would have to use a foreign key column in the 
object table. Another important situation where schema divergence is used in our 
example product line is operation propagation. To deal with DS-DDL schema diver- 
gence, each reduction method for a given node comes with a set of preconditions 
related to DS-DDL schema that have to be satisfied for method execution. 

Source-graph divergence is applied in the following way. In filtered navigation 
within a workspace, we have to use the table causedByF_ratedCostsF to arrive at 
costs. The obtained versions are further filtered in lines 9, 11, and 13 to arrive only at 
costs attached to the workspace with globalld 435532. The situation would be different 
outside a workspace, where another table which stores the materialized globallds of 
versions of costs that are either pinned or latest in the corresponding CVC would have 
to be used for the join. Thus the reduction of the second navigation step depends on 
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whether the clause USE workspace is used. To deal with source-graph divergence, 
each reduction method for a given node comes with a set of preconditions related to 
node neighborhood in the source graph that have to be satisfies for method execution. 

Due to source-graph divergence, line 3 of the DS-DML statement gets reduced to 
lines 9-15 of the SQL-DML statement. 

Obviously, it is a good choice for the developer to shift decisions due to divergence 
to many “very specialized” reduction methods that can be reused in diverse superordi- 
nated methods and thus abstract from both types of divergence. In this way, the subor- 
dinated methods can be explicitly invoked by the developer using generic calls and the 
driver itself selects the matching method. Four different APIs are available to the devel- 
oper within a reduction method. 

• Source tree traversal. This API is used to explicitly traverse the neighboring nodes 
to make reduction decisions not automatically captured by source-graph polymor- 
phism. The API is automatically generated from the DS-DML metamodel. 

• DS-DDL schema traversal. This API is used to explicitly query the DS-DDL sche- 
ma to make reduction decisions not automatically captured by DS-DDL schema 
polymorphism. The API is automatically generated from the DS-DDL metamodel. 

• SQL-DML API. This API is used to manipulate the SQL-DML source graphs. 

• Reduction API. This API is used for explicit invocation of reduction methods on 
subordinated nodes in the DS-DML source graph. 



7 Conclusion and Future Work 

In this paper, we examined the topic of custom schema development and data manipu- 
lation languages which facilitate increased reuse within database-oriented software 
product lines. Our empirical evaluation, based on an example product line for version- 
ing systems, shows that the portion of time required for mapping domain-specific state- 
ments to SQL at run time is below 9.9%. For this reason, we claim that domain-specific 
languages introduce great benefits in terms of raising the abstraction level in schema de- 
velopment and data queries at practically no cost. 

There is a range of topics we want to focus on in our future work. Is there a way to 
make DS-DMLs even faster? Complex reduction methods can clearly benefit from the 
following ideas. 

• Source graphs typically consist of an unusually large number of objects that have 
to be created at run time. Thus the approach could benefit from instance pools for 
objects to minimize object creation overhead. 

• Caching of SQL-DML source graphs can be applied to reuse them when reducing 
upcoming statements. 

• Would it be possible to use parameterized stored procedures to answer DS-DML 
statements? This makes the reduction of DS-DML statements simpler, because a 
statement can be reduced to a single stored procedure call. On the other hand, it 
makes the reduction of DS-DDL schema more complex, because stored procedures 
capable of answering the queries have to be prepared. We assume this approach is 
especially useful when many SQL-DML statements are needed to execute a DS- 
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DML statement. Implementing a stored procedure for a sequence of statements 
avoids excessive communication between (a) the domain-specific and the native 
driver and (b) between the native driver and the database. 

• In a number of cases where a sequence of SQL-DML statements is produced as a 
result of reduction, these statements need not necessarily be executed sequentially. 
Thus developers of reduction methods should be given the possibility to explicitly 
mark situations where the driver could take advantage of parallel execution. 

In addition, dealing with DS-DDL schemas raises two important questions. 

• DS-DDL schema evolution. Clearly, supplementary approaches are required to deal 
with modifications in a DS-DDL schema which imply a number of changes in ex- 
isting SQL-DDL constructs. 

• Product-line mining. Many companies develop and market a number of systems 
implemented independently despite their structural and functional similarities, i.e., 
without the proper product-line support. Existing schemas for these systems could 
be mined to extract common domain-specific abstractions and possible reductions, 
which can afterwards be used in future development of new systems. 
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Abstract. We present an approach to support incremental navigation of struc- 
tured information, where the structure is introduced by the data model and schema 
(if present) of a data source. Simple browsing through data values and their 
connections is an effective way for a user or an automated system to access 
and explore information. We use our previously defined Uni-Level Description 
(ULD) to represent an information source explicitly by capturing the source's 
data model, schema (if present), and data values. We define generic operators 
for incremental navigation that use the ULD directly along with techniques for 
specifying how a given representation scheme can be navigated. Because our nav- 
igation is based on the ULD. the operations can easily move from data to schema 
to data model and back, supporting a wide range of applications for exploring 
and integrating data. Further, because the ULD can express a broad range of data 
models, our navigation operators are applicable, without modification, across the 
corresponding model or schema. In general, we believe that information sources 
may usefully support various styles of navigation, depending on the type of user 
and the user's desired task. 



1 Introduction 

With the WWW at our fingertips, we have grown accustomed to easily using unstruc- 
tured and loosely-structured information of various kinds, from all over the world. With 
a web browser it is very easy to: (1) view information (typically presented in HTML), 
and (2) download information for viewing or manipulating in tools available on our 
desktops (e.g., Word, PowerPoint, or Adobe Acrobat files). In our work, we are focused 
on providing similar access to structured (and semi-structured) information, in which 
data conforms to the structures of a representation scheme or data model. 

There is a large and growing number of structural representation schemes being 
used today including the relational, E-R, object-oriented, XML, RDF, and Topic Map 
models along with special-purpose representations, e.g., for exchanging scientific data. 
Each representation scheme is typically characterized by its choice of constructs for 
representing data and schema, allowing data engineers to select the representation best 
suited for their needs. However, there are few tools that allow data stored in different 
representations to be viewed and accessed in a standard way, with a consistent interface. 

* This work supported in part by NSF grants EIA 9983518 and ITR 0225674. 

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 668-681, 2004. 
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The goal of this work is to provide generic access to structured information, much 
like a web browser provides generic access to viewable information. We are particularly 
interested in browsing a data source where a user can select an individual item, select a 
path that leads from the item, follow the path to a new item, and so on, incrementally 
through the source. 

The need for incremental navigation is motivated by the following uses. First, we 
believe that simple browsing tools provide people with a powerful and easy way to ac- 
cess data in a structured information source. Second, generic access to heterogeneous 
information sources supports tools that can be broadly used in the process of data in- 
tegration [8, 10]. Once an information source has been identified, its contents can be 
examined (by a person or an agent) to determine if and how it should be combined (or 
integrated) with other sources. 

In this paper, we describe a generic set of incremental-navigation operators that are 
implemented against our Uni-Level Description (ULD) framework [4,6]. We consider 
both a low-level approach for creating detailed and complete specifications as well as 
a simple, high-level approach for defining specifications. The high-level approach ex- 
ploits the rich structural descriptions offered by the ULD to automatically generate 
the corresponding detailed specifications for navigating information sources. Thus, our 
high-level specification language allows a user to easily define and experiment with 
various navigation styles for a given data model or representation scheme. The rest of 
this paper is organized as follows. In Section 2 we describe motivating examples and 
Section 3 briefly presents the Uni-Level Description. In Section 4, we define the incre- 
mental navigation operators and discuss approaches to specifying their implementation. 
Related work is presented in Section 5 and in Section 6 we discuss future work. 



2 Motivating Examples 

When an information agent discovers a new source (e.g., see Figure 1) it may wish 
to know: (1) what data model is used (is it an RDF, XML, Topic Map, or relational 
source?), (2) (assuming RDF) whether any classes are defined for the source (what is 
the source schema?), (3) which properties are defined for a given class (what properties 
does the film class have?), (4) which objects exist for the class (what are the instances 
of the film class?) and (5) what kinds of values exist for a given property of a particular 
object of the class (what actor objects are involved in this film object?). 

This example assumes the agent (or user) understands the data model of the source. 
For example, if the data model used was XML (e.g., see Figure 2) instead of RDF, 
the agent could have started navigation by asking for all of the available element types 
(rather than RDF classes). We call this approach data-model-aware navigation , in which 
the constructs of the data model can be used to guide navigation. 

In contrast, we also propose a form of browsing where the user or agent need not 
have any awareness of the data-model structures used in a data source. The user or 
agent is able to navigate through the data and schema directly. As an example (again 
using Figure 1), the user or agent might ask for: (1) the kind of information the source 
contains, which in our example would include “films,” “actors,” and “awards,” etc., 
(2) (assuming the crawler is interested in films) the things that describe films, which 
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Fig. 1 . An example of an RDF schema and instance. 

XML DTD: XML Instance: 

<! ELEMENT moviedb (movie* )> <?xml version=' ' 1 . 0 ' ' ?> 

<! ELEMENT movie (title, studio, genre* , actor* ) > <moviedb> 

< ! ELEMENT title (#PCDATA)> <movie> 

<! ELEMENT studio (#PCDATA)> <title>The Usual Suspects</ ti tle> 

<!ELEMENT genre (#PCDATA)> <studio>Gramercy</s tudio> 

<!ELEMENT actor (#PCDATA)> <genre>Thriller</genre> 

<!ATTLIST actor role CDATA #REQUIRED> <actor role=' ' supporting actor''> 

Spacey, Kevin 
</actor> 

</movie> 

</moviedb> 



Fig. 2. An example XML DTD (left) and instance document (right). 



would include “titles” and relationships to awards and actors, (3) the available films 
in the source, and (4) the actors of a particular film, which is obtained by stepping 
across the “involved” link for the film in question. We call this form of browsing simple 
navigation. 

3 The Uni-level Description 

The Uni-Level Description (ULD) is both a meta-data-model (i.e., capable of describing 
data models) and a distinct representation scheme: it can directly represent both schema 
and instance information expressed in terms of data-model constructs. Figure 3 shows 
how the ULD represents information, where a portion of an object-oriented data model 
is described. The ULD is a flat representation in that all information stored in the ULD 
is uniformly accessible (e.g., within a single query) using the logic-based operations 
described in Table 1 . 

Information stored in the ULD is logically divided into three layers, denoted meta- 
data-model , data model , and schema and data instances. The ULD meta-data-model, 
shown as the top level in Figure 3, consists of construct types that denote structural 
primitives. The middle level uses the structural primitives to define both data and 
schema constructs, possibly with conformance relationships between them. Constructs 
are necessarily instances of construct types, represented with ct-inst instance-of links. 




Incremental Navigation 



671 



Meta-Data-Model 
(Construct Types) 

Data Model 

(Schema 

Constructs) 



Schema 

(Instances) 



A c 


? o ^ 


p ° 


ct-inst 




ct-inst 


class ^ 


^ conf ^ 


] object 


c-inst 




c-inst 


actor q 


d-inst 


1 ‘Robert 
J De Niro ’ 



Data Model 
(Data 

Constructs) 



Data 

(Instances) 



Fig. 3. The ULD meta-data-model architecture. 



Table 1 . The ULD operations expressed as logical formulas. 



Operations for defining instance-of relationships 



ct-inst(c, ct ) Construct c is a ct-inst of construct-type ct. 

c-inst (d, c) Construct instance d is a c-inst of construct c. 

d-inst(di, df) Construct instance d± is a d-inst of instance cfe. 

conf(d, r, ci, C2) Instances of construct ci can conform to instances of construct C2 as required by domain d and range r 
cardinality constraints (exactly one 1 , zero or one ?, zero or many *, or one or many +). 



Operations for restricting construct instances 

ci .s - >C2 Instances of construct ci have selectors s with instances of construct C2 as values. 

setof(ci, C2) Instances of construct ci are sets whose members are construct C2 instances. 

bagof(ci, C2) Instances of construct ci are bags whose members are construct C2 instances. 

listof(ci, C2) Instances of construct ci are lists whose members are construct C2 instances. 

unionof(ci, C2) Instances of construct C2 are also construct ci instances. 



Operations for accessing instance structures 



d\.s:d2 Construct instance d\ has value d\ for selector s. 

di £ d2 Construct instance d\ is a member of collection d2. 

di [i]=d2 Construct instance d2 is at the i-th position in the list d\ . 

\d\ | =n The length of the collection construct instance di is n. 

d\ £ £d2=n Construct instance d\ is a member of the bag d2 n times. 



Similarly, every item introduced in the bottom layer, denoting actual data or schema 
items, is necessarily an instance of a construct in the middle layer, represented with 
c-inst instance-of links. An item in the bottom layer can be an instance of another item 
in the bottom layer, represented with d-inst instance-of links, as allowed by the confor- 
mance relationships specified in the middle layer. For example, in Figure 3, the class 
and object constructs are related through a conformance link, labeled conf, and their 
corresponding construct instances in the bottom layer, i.e., actor and the object with the 
name ‘Robert De Niro’ are related through a data instance-of link, labeled d-inst. 

The ULD offers flexibility through the conf and d-inst relationships. For example, 
an XML element that does not have an associated element type can be represented in 
the ULD; the element would simply not have a d-inst link to any XML element type. 

The ULD represents an information source as a configuration containing the con- 
structs of a data model, the construct instances (both schema and data) of a source, 
and the associated conformance and instance-of relationships. A configuration can be 
viewed as an instantiation of Figure 3. Each configuration uses a finite set of identifiers 
to denote construct types, constructs, and construct instances as well as a finite set of 
ct-inst, c-inst, conf, and d-inst facts. We note that a configuration can be implemented 
as a logical view over an information source, and is not necessarily “materialized.” 

The ULD meta-data-model contains primitive structures for tuples, i.e., sets of 
name-value pairs; set, list, and bag collections; atomics, for scalar values such as strings 
and integers; and unions, for representing non- structural, generalization relationships 











672 



Shawn Bowers and Lois Delcambre 



schema constructs: 

ct-inst(elemType, struct-ct) 
ct-inst(attDefSet, set-ct) 
ct-inst(attDef, struct-ct) 
ct-inst(contentDef, set-ct) 
data constructs: 

ct-inst(element, struct-ct) 
ct-inst(attSet, set-ct) 
ct-inst(attribute, struct-ct) 
ct-inst(content, list-ct) 
ct-inst(node, union-ct) 
unionof(node, pcdata) 



ct-inst(pcdata, atomic-ct) 
ct-inst(cdata, atomic-ct) 
elemType.hasName - >uldString 
elemType.hasAtts - >attDefSet 

conf(*, ?, element, elemType) 
conf(*, ?, attribute, attDef) 
unionof(node, element) 
element.hasTag - >uldString 
element.hasAtts - >attSet 



elemType.hasModel - >contentDef 
setof(attDefSet, attDef) 
attDef.hasName - >uldString 
setof(contentDef, elemType) 

element.hasChildren - >content 
setof(attSet, attribute) 
attribute.hasName - >uldString 
attribute.hasVal - >cdata 
listof(content, node) 



Fig. 4. The XML with DTD data model. 



among constructs. The construct-type identifiers for these structures are denoted struct- 
ct, set-ct, list-ct, bag-ct, atomic-ct, and union-ct, respectively. 

Figures 4, 5, and 6 give example descriptions of simplified versions of XML with 
DTDs, RDF with RDF Schema, and sample schema and data (from Figure 1 ) for the 
RDF model, respectively 1 . We note that there are potentially many ways to describe a 
data model in the ULD, and these examples show only one choice of representation. 

The XML data model shown in Figure 4 includes constructs for element types, 
attribute types, elements, attributes, content models, and content, where element types 
contain attribute types and content specifications, elements can optionally conform to 
element types, and attributes can optionally conform to attribute types. We simplify 
content models to sets of element types for which a conforming element must have at 
least one subelement for each corresponding type. 

The RDF data model with RDF Schema (RDFS) of Figure 5 includes constructs for 
classes, properties, resources, and triples. A triple in RDF contains a subject, predicate, 
and object, where a predicate can be an arbitrary resource, including a defined property. 
In RDFS, rdf : type, rdf s : subClassOf , and rdf s : subPropertyOf are con- 
sidered special RDF properties for denoting instance and specialization relationships. 
However, we model these properties using conformance and explicit constructs. For ex- 
ample, a subclass relationship is represented by instantiating a subClassOf construct 
as opposed to using the special rdf s : subClassOf RDF property 2 . 

A ULD query is expressed as a Datalog program [1] and is executed against a con- 
figuration. As an example, the first query below finds all available class names within an 
RDF configuration. Note that upper-case terms denote variables and lower-case terms 
denote constants. The rule is read as “If C is an RDF class and the label of C is X, 
then X is a classname.” The second query returns the property names of all classes in 
an RDF configuration. This query, like the first, is expressed solely against the schema 
of the source. The third query below is expressed directly against data, and returns the 
URI of all RDF resources used as a property in at least one triple, where the resource 
may or may not be associated with schema. 

1 We use uldValue and uldValuetype as special constructs to denote scalar values and 
value types [4, 6], Also, uldString and uldURI are default atomic constmcts provided by 
the ULD. 

2 This ULD representation of RDF allows properties and isa relationships to be decoupled (com- 
pared with RDF itself). This approach does not limit the expressibility of RDF: partial, op- 
tional, and multiple levels of schema are still possible. 
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schema constructs: 

ct-inst(resource, union-ct) 
ct-inst(rdfType, union-ct) 
ct-inst(simpleRes, struct-ct) 
ct-inst(class, struct-ct) 
ct-inst(prop, struct-ct) 
ct-inst(rangeVal, union-ct) 
ct-inst(subClass, struct-ct) 
ct-inst(subProp, struct-ct) 
conf(*, *, simpleRes, class) 
subProp.hasSuper - >prop 
data constructs: 

ct-inst(triple, struct-ct) 
ct-inst(objVal, union-ct) 
ct-inst(literal, atomic-ct) 



conf(*, *, class, class) 
conf(*, *, prop, rdfType) 
unionof(rangeVal, class) 
unionof(rangeVal, uldValueType) 
subClass.hasSub - >class 
unionof(resource, rdfType) 
unionof(resource, simpleRes) 
unionof (rdfType, class) 
unionof (rdfType, prop) 



unionof(objVal, resource) 
triple.hasPred - >resource 
triple.hasSubj - >resource 



simpleRes. hasURI - >uldURI 
class.hasURI - >uldURI 
class.hasLabel - >uldString 
prop.hasURI - >uldURI 
prop.hasLabel - >uldString 
prop.hasDomain - >class 
prop.hasRange - >rangeVal 
subClass.hasSuper- >class 
subProp.hasSub - >prop 



triple.hasObj - >obj Val 
unionof(objVal, literal) 



Fig. 5. The RDF with RDF Schema data model. 



schema: 

c-inst(film, class) 
c-inst(title, prop) 
c-inst(thriller, class) 
c-inst(filmthril, subclass) 
prop.hasDomain: film 

data: 

c-inst(ml, simpleRes) 
c-inst(tl, triple) 
d-inst(ml, thriller) 



thriller.hasURI: ‘#thriller’ 
filmthril.hasSub:thriller 
film.hasURI: ‘#film’ 
film.hasLabel: ‘film’ 
prop.hasURI: ‘#title’ 

d-inst(ml, film) 

ml.hasURI:‘#ml’ 

tl.hasPred:title 



prop.hasLabel: ‘#hasTitle’ 
prop.hasRange: ‘literal’ 
thriller.hasLabel: ‘thriller’ 
filmthril.hasSuper:film 



tl.hasSubj:ml 

tl.hasObj:‘The Usual Suspects’ 



Fig. 6. Portion of schema and data for RDF(S). 



classname(X) <— c-inst(C, class), C.hasLabehX. 

hasProp(X, Y) <— c-inst(C, class), c-inst(P. prop), P.hasDomain:C, C.hasLabehX, P.hasLabehY. 
dataprop(X) <— c-inst(T, triple), T.hasPred:P, P.hasURLX. 

The following three queries are similar to the previous three, but are expressed 
against an XML configuration. The first query finds the names of all available ele- 
ment types in the source, the second finds, for each element-type name, its correspond- 
ing attribute-definition names, and the last finds all available attribute names as a data 
query. 

elemtype(X) c-inst(E, elemType), E.hasName:X. 

atttype(X.Y) <— c-inst(E, elemType), E.hasName:X, E.hasAtts:AL, ATeAL, AT.hasName:Y. 
atts(X) <— c-inst(A, attribute), A.hasName:X. 

Finally, the following query returns all constructs that serve as struct-ct schema 
constructs and their component selectors. This query is solely expressed against the 
data-model constructs. 

schemastructfSC, P) <- ct-inst(SC, struct-ct), conf(DC, SC, X, Y), SC.P->C. 



4 Navigation Operators 

The ULD presents a complete, highly detailed description of a data source, with in- 
terconnected model, schema, and data information. In the ULD, each construct type, 
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construct, and instance is represented by an id and every id, in turn, has an associated 
value. The value can be either an atomic value (such as a literal in RDF) or a structured 
value (such as a set or bag of ids). 

We view navigation as a process of traversing a graph consisting of locations (nodes) 
and links (bi-directional edges), superimposed over a ULD source file. A location is 
either a construct type, construct, or instance in the ULD; thus a location is anything 
with an id. A link is a (simple or compound) path, from one location to another, through 
the connections in the ULD. 

A navigation binding consists of an implementation for the following functions. For 
a binding, we assume a finite set of location names £ and a finite set of link names A f, 
where both £ and M consist of atomic string values. Navigation consists of moving 
from one location name to another. The binding should include only those locations 
that are meaningful to the intended user community, with appropriate links. 

Starting Points. The operator sloe : V(C) returns all available entry points into an information 
source. We require the result of sloe to be a set of locations (as opposed to links). Note that 
V(C) stands for the power set of £. 

Links. The operator links : £ — > V (A f) returns all out-bound links available from a particular 
location. For some locations, there may not be any links available, i.e., the links operator 
may return the empty set. 

Following Links. The operator follow : £ x Af —> V{£) returns the set of locations that are at 
the end of a given link. We use the follow operator to prepare to move to a new location from 
our current location. Given the set of locations returned by the follow operator, the user or 
agent directing the navigation can choose one as the new location. 

Types. The operator types : C — > V(£) returns the (possibly empty) set of types for a given 
location. We use the types operator to obtain locations that represent the schema for a data 
item. A particular location may have zero, one, or many associated types. 

Extents. The operator extent : C — > V(£) returns the (possibly empty) set of instances for a 
given location. The extent operator computes the inverse of the types operator. 

As a simple example, the following (partial) navigation binding can be defined for the 
data shown in Figure 6 

sloe = {‘class’, ‘property’, ‘subclass’, ‘resource’, ‘triple’, ...} 

extent('class’) = {‘film’, ‘comedy’, ‘thriller’, ...} 

links(‘thriller") = {‘title’} 

extentC thriller’) = {‘#ml’, ...} 

follow(‘#ml’, ‘title’) = {‘The Usual Suspects’} 

We express the navigation functions in Datalog using the predicates described be- 
low. An operator binding is a set of ULD queries, where the head of each query is 
a navigation operation (expressed as a predicate). Thus, operator bindings are defined 
as global-as-view mappings from the ULD (typically, over any configuration of a data 
model) to the navigation operations. We propose two ways to specify a navigation bind- 
ing: as a set of low-level ULD queries and as a high-level specification that is used to 
automatically generate the appropriate navigation bindings. 

- sloc(r), where r represents a starting location. 

- linksfl, k), where A; is a link from location l. 
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- follow(l, k, r), where r is a location that is found by following link k from some location l. 

- types(l, r), where l is a location of type r. 

- extent(l, r), where r is in the extent of l. 

To illustrate, the following low-level binding queries present a view of an RDF con- 
figuration, where we only allow navigation from data items with corresponding schema. 
Thus, this example supports simple browsing. The starting locations are classes and the 
available links from a class are its associated properties and its associated instances. 
The definition uses an additional intensional predicate subClassClosure for computing 
the transitive closure of the RDF subclass relationship. 

sloc(L) <— c-inst(C, class), C.hasLabehL. 

links(L, K) <— c-inst(C, class), C.hasLabehL, c-inst(P, property), 

RhasLabekK, P.hasDomain:C , ; subClassClosure(C, C). 
linksfL, K) <— c-inst(C, class), d-inst(0, C), 0.hasURI:L, c-inst(T, triple), T.hasPred:X, 

T.hasSubj:0, c-inst(X, property). X.hasLabekK, X.hasDomain:C', 
subClassClosure(C, C')- 

followfL, K. R) <— c-inst(C, class), C.hasLabehL, c-inst(T, property), T.hasLabekK, 

T.hasDomain:C', T.hasRange:C", C".hasLabel:R, subClassClosure(C, C')- 
followfL, K. R) <— c-inst(C, class), C.hasLabehL, c-inst(T, property), T.hasLabehK, 

T.hasDomain:C , , T.hasRange:R, R=‘literal', subClassClosure(C, C' ). 
followfL, K. R) <— c-inst(C, class), d-inst(0, C), O.hasURhL, c-inst(T, triple), T.hasPred:X, 

T.hasSubj:0, T.hasObj:R, c-inst(R, literal), c-inst(X, property), X.hasLabehK, 
X.hasDomain:C / , subClassClosure(C, C). 

followfL, K. R) <— c-inst(Cl. class), d-inst(01. Cl), Ol.hasURLL, c-inst(T, triple), T.hasPred:X, 
T.hasSubj:01, T.hasObj:02, 02.hasURI:R, c-inst(X, property). X.hasLabehK, 
X.hasDomain:C , , subClassClosure(C, C'). 
extent(L, R) <— c-inst(C, class), C.hasLabehL. d-inst(0, C). O.hasURLR. 
type(L, R) <— c-inst(C, class), C.hasLabel=R. d-inst(0, C), O.hasURhL. 

subClassClosure(C, C) <— c-inst(C, class). 

subClassClosure(Cl, C3) <— c-inst(S, subclass), S.hasSub:Cl, S.hasSuper:C2, 
subClassClosure(C2. C3). 



In general, with low-level binding queries a user can specify detailed and exact 
descriptions of the navigation operations for data sources. To specify higher-level bind- 
ings, a user selects certain constructs as locations and certain other constructs as links. 
Using this specification, the navigation operators are automatically computed by 
traversing the appropriate instances of locations and links in the configuration. Figure 7 
shows an example of a high-level binding definition for RDF, where RDF classes, re- 
sources, and literals are considered sources for locations and RDF properties and triples 
are considered sources for links (Figure 10 shows a similar binding for XML, which we 
discuss later). 

We define a high-level binding specification as a tuple (L, N, S , F). The disjoint 
sets L and N consist of construct identifiers such that L is the set of constructs used as 
locations and N is the set of constructs used as links. The set S C L gives the entry 
points of the binding. Finally, the set F contains link definitions (described below). 

Each construct in L and N has an associated naming definition that describes how 
to compute the name of an instance of the construct. The name would typically be 
viewed by the user during navigation. The naming definitions serve to map location and 
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RDF Binding: 

B = ( L,N,S,F ) 

L — class, resource, literal 
N = property, triple 
S — class 

F = triple : resource [ triple/hasSubj ] => literal [ triple/hasObj ], 
triple : resource [ triple/hasSubj ] =>• resource [ triple/hasObj ], 
property : class [ property /hasDomain ] class [ property /hasRange ], 

property : class [ property/hasDomain ] => literal [ property/hasRange ] 

name(X, N) c-inst(X, class), X.hasLabebN. 
name(X, N) •<— c-inst(X, resource), X.hasURLN. 
name(N, N) <— c-inst(N, literal). 
name(X, N) •<— c-inst(X, property), X.hasLabebN. 

name(X, N) ■<— c-inst(X, triple), X.hasPred:R, c-inst(R, property), R.hasLabebN. 

Fig. 7 . A high-level binding for simple navigation of RDF. 



link instances to appropriate string values. For example, in Figure 7 , RDF classes and 
properties are named by their labels, resources are named by their URI values, a literal 
value is used directly as its name, and the name of a triple is the name of its associated 
predicate value. 

The incremental operators in a high-level binding specification are computed auto- 
matically by traversing connected instances. We define the following generic rules to 
compute when two instances are connected. (Note that these connected rules only per- 
form single-step traversal, and can be extended to allow an arbitrary number of steps, 
which we discuss at the end of this section.) 

connectedfXi, X2) <— c-inst(Xi, C), ct-inst(C, struct-ct), Xi.S:X2. 
connected(Xi, X2) <— c-inst(Xi, C), ct-inst(C, bag-ct), X2GX1. 
connected(Xi, X2) <— c-inst(Xi, C), ct-inst(C, list-ct), X2 GXi. 
connected(Xi, X2) <— c-inst(Xi, C), ct-inst(C, set-ct), X2 £Xi. 
connected(X2, Xi) <— connected(Xi , X2). 

A connected formula is true when there is a structural connection between two in- 
stances. Note that the rules above do not consider the case when two items are linked 
by a d-inst, relationship. Instead, the d-inst relationship is directly used by the types 
and extent operators, whereas connections are used by the links and follow operators. 
The link definitions of F have the form Ck '■ ci[pi] => C2[p2] where: 

- The behavior of the link construct Ck £ N is being described by the rest of the expression. 
For example, in Figure 7 , the first link definition is for triple constructs. 

- For ci , C2 € L, the construct Ck can serve to link an instance of ci to an instance of C2. Thus, 
we can traverse from instances of ci to instances of C2 via an instance of c*,. For example, 
in Figure 7 , the first link definition says that we can follow a resource instance to a literal 
instance if they are connected by a triple. 

- The expressions pi and P2 further restrict how instances of Ck can be used to link ci and C2 
instances, respectively. For example, in Figure 7 , the first link definition states that triples 
link resources and literals through the triple's hasSubj and hasObj selector, respectively. 

We define the linkSourcei f , if, if.) and linkTarget(f , ik, *2) clauses as follows. 
Given a link definition f £ F and a connection from i\ to ik such that connected(ii, ik) 
is true (where ik is the link instance), linkSource(f , if, ik) is true if / = Ck '■ Ci[pi] => 
C2IP2] such that h is an instance of ci (that is, c-instOi, Ci) is true), ik is an instance of 
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sloc(R) <— Bs( S), c-inst(X, S), name(X, R). 

links(L, K) <— name(X, L), connected(X, Y), Bf(F), linkSource(F, X, Y), name(Y, K). 
follow(L, K, R) <— name(X, L), connectedfX, Y), Bf( F), linkSource(F, X, Y), name(Y, K), 
connected(Y, Z), linkTarget(F, Y, Z), name(Z, R). 
type(L, R) <— name(X, L). c-inst(X, P), Bl( P) d-inst(X, Y), c-inst(Y, Q), B L ( Q), 

name(Y, R). 

extent(L, R) <— name(Y, L), c-inst(Y, Q), Bl( Q), d-inst(X, Y), c-inst(X, P), Bl(P), 
name(X, R). 

Fig. 8. Datalog rules to compute links and locations. 

Cfc (that is, c-inst(ifc, Ck) is true where Ck is a link construct), and i\ and ip are connected 
according to the expression pi . Similarly, for an f £ F, linkTarget(f , ik, * 2 ) is true if 
/ = Cfc : ci[pi] => C 2 [p 2 ] such that ik is an instance of Ck (a link construct), *2 is an 
instance of C 2 , and ik and 12 are connected according to the expression^- 

Given the above definitions, we automatically compute each navigation operator 
using the Datalog queries in Figure 8. We assume each operator is represented as an 
intensional predicate (as before) and the binding specification B = (L, TV, S, F ) is 
stored as a set of unary extensional predicates Bp, Bn, Bs, and Bp. For example, the 
expression Bp(X) binds X to a location in L for binding B. We also assume that the 
name predicate is stored as an intensional formula (as defined in the binding). 

The first rule in Figure 8 finds the set of entry points: It obtains a construct in 
the set of starting locations, finds an instance of the construct, and then computes the 
name of the instance. The second rule finds the locations with links. For each named 
instance in the configuration that is connected to another instance, we use the linkSource 
predicate to check if it is a valid connection, we check to make sure that the link instance 
(represented as the variable Y) is valid, and then compute the name of the instance. The 
third rule is similar to the second, except it additionally uses the linkTarget to determine 
the new location. Finally, the last two rules use the d-inst relationship to find types and 
extents, respectively. 

To demonstrate the approach, we use the binding definition of Figure 7 and the 
sample configuration of Figure 9. This configuration shows part of Figure 1 as a graph 
whose nodes are construct instances and edges are either connections between struc- 
tures or d-inst links. Consider the following series of invocations. 

1 . sloe = { ‘film' , ‘thriller’ } . According to the binding definition, the sloe operator returns all the 
labels of class construct instances. As shown in Figure 9, the only class construct instances 
are thriller and film. 

2. linksCfWm ' ) = {‘title'}. The links operator is computed by considering each connected con- 
struct instance of film until it finds a construct instance whose associated construct is in N. 
As shown, the only such instance is title, which is an RDF property. 

3. extentV film’) = {‘#ml'}. The extent operator looks for the d-inst links of the given instance. 
As shown, the only such link for film is to ml. 

4. follow(‘#m \ ’, ‘title’) = {‘The Usual Suspects'}. The follow operator starts in the same way 
as the links operator by finding instances (whose constructs are in N) that are connected to 
the given instance. For the given item, the only such instance in Figure 9 is tl. The follow 
operator then returns the hasObj component of tl, according to the link definition for triple. 
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thriller (class) 
hasSub^v^-^. 
film-thriller (subclass) d-inst '\ hasURI 

hasSuper 



‘#ml’ (URI) 



film (class) 

hasDomain 
title (property) < 



, s) <V 



d-inst ... 

hasSubj 
tl (triple) 
hasPred 



hasRange 




ml (resource) 
hasObj 

‘The Usual Suspects’ (literal) 



‘literal’ 



Q location 
^ path 
Q unbound 



Fig. 9. An example of labeled instances for RDF. 



Note that for this example, the rules for computing the follow and links operators 
only consider instances that are directly connected to each other (through the connected 
predicate). Thus, invoking the operator links( ‘thriller’) will not return a result (for our 
example) because there are no link construct instances directly connected to the associ- 
ated RDF class. However, the properties of a class in RDF also include the properties of 
its superclasses. We can include such information by expanding the set of connection 
rules. One approach is to allow the navigation specifier to add a connected rule specif- 
ically for the subclass case. Alternatively, we can extend the connection definition to 
compute the transitive closure (of connections) using the following rule. 

connected(Xi, X 3 ) <— connected(Xi , X 2 ), connected(X 2 , X 3 ). 

We also allow binding specifications to include multiple-step path expressions in F. 
For example, we add the following link definitions to the RDF binding specification to 
correctly support subclasses. 

property : class [ property/hasDomain/hasSuper/hasSub ] =*> class [ property/hasRange ] 
property : class [ property/hasDomain/hasSuper/hasSub ] =£■ literal [ property/hasRange ] 

Finally, Figure 10 shows a high-level binding specification for the XML data model 
of Figure 4. The binding specification assumes the transitive connection relation de- 
fined above. Element types, elements, and atomic data serve as locations, with element 
types as starting locations. Attribute definitions, attributes, content definitions, and con- 
tent serve as links. We use ‘hasChildType’ and ‘hasChild’ strings as the names of the 
links for content definitions and element content, respectively. Note the attDef link def- 
inition is a special case in which attDef links always lead to an empty set of locations 
(denoted using the empty set in Figure 10). Also, the ending 7’ in a link-definition path 
denotes traversal into the elements of a collection structure (as opposed to denoting the 
collection structure itself). 

5 Related Work 

A number of approaches provide browsing capability for traditional databases. Motro 
[14] seeks to enable users who are (1) not familiar with the data model of the system, (2) 
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XML Binding: 

B = ( L,N,S,F ) 

L — elemType, element, cdata, pcdata 
N = attDef, attribute, contentDef, content 
S — elemType 

F = attDef : elemType [ elemType/hasAtts/ ]=>•{}, 

attribute : element [ element/hasAtts/ ] cdata [ attribute/hasVal ], 
contentDef : elemType [ elemType/hasModel ] =>• elemType [ contentDef/ ], 
content : element [ element/hasChildren ] => element [ content/ ], 
content : element [ element/hasChildren ] =>• pcdata [ content/ ] 



name(X, N) 
name(X, N) 
name(X, X) 
name(X, X) 
name(X, N) 
name(X, N) 
name(X, N) 
name(X, N) 



c-inst(X, elemType), X.hasName:N. 
c-inst(X, element). 
c-inst(X, cdata). 
c-inst(X, pcdata). 
c-inst(X, attDef), X.hasName:N. 
c-inst(X, attribute), X.hasName:N. 
c-inst(X, contentDef), N=’hasChildType\ 
c-inst(X, content), N=’hasChild’. 



Fig. 10. A high-level binding for direct navigation of XML. 



not familiar with the organization of the database (i.e., the schema), (3) not proficient 
with the use of the system (i.e., the query language), (4) not sure what data they are 
looking for (but are looking for something interesting or suitable), and/or (5) not clear 
how to construct the desired query. As more structured information finds its way on the 
Web, we believe these issues become more pressing for users as well as for software 
agents wishing to exploit structured information. 

Database browsing typically assumes a fixed data model [2, 14, 12, 17, 3, 8, 18] (ei- 
ther relational, E-R, or object-oriented). Only a few systems allow browsing schema and 
data in isolation [3, 12, 14], where most support browsing data only through schema 
(i.e., navigating data using items of the schema). Hypertext systems, including those 
with structured data models, also use browser-based interfaces [15, 13]. These systems, 
as in database approaches, are developed for a single data model, and support limited 
browsing styles. 

The links and locations abstraction used by incremental navigation is similar in 
spirit to the graph-based model of RDF and RDF Schema. The Object Exchange Model 
(OEM) - the semi-structured representation of TSIMMIS [ 1 1, 16] - is another simple 
abstraction. Both TSIMMIS and some database browsing systems [2,9, 14,8] support 
user navigation mixed with user queries, which we would like to explore as an extension 
to our current navigation operators. Finally, Clio [18] provides some support for navi- 
gation, specifically to help users build data-transformation queries. Clio supports “data 
walks,” which display example data involved in each potential join path between two 
relations, and “data chases,” which display all occurrences of a specific value within a 
database. 



6 Conclusion and Future Work 

We believe incremental navigation provides a simple, generic abstraction, consisting of 
links and locations (and types and extents when applicable), that can be applied over 
arbitrary data models. More than that, with the high-level binding approach, it becomes 
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relatively straightforward to specify the links and locations for a data model, thus en- 
abling generic and uniform access to information represented in any underlying data 
model (described in the ULD). We believe that this approach can be extended beyond 
navigation to include querying information sources (i.e., querying links and locations) 
and for specifying high-level mappings between data sources. 

We have implemented a prototype browser [5] to demonstrate incremental naviga- 
tion, both for data-model aware and simple navigation of RDF, XML, Topic Map, and 
relational sources. In addition, some of the ideas of incremental navigation appear in the 
Superimposed Schematics browser [7], which allows users to incrementally navigate an 
ER schema and data source. Based on these experiments, we believe incremental navi- 
gation is viable, and helps reduce the work required to develop such browsing tools. 

For future work, we intend to investigate whether additional ULD information can 
be used, such as data-model constraints, to help validate and generate operator bind- 
ings. We are also interested in defining a language to express path-based queries over 
the links and locations abstraction offered by incremental navigation. One issue is to 
determine whether algorithms and optimizations can be defined to efficiently compute 
(i.e., unfold) the binding-specification rules to answer such path queries. Finally, we 
believe that the incremental-navigation operators can be easily expressed as a standard 
web-service interface (where information sources have corresponding web-service im- 
plementations), providing generic, web-based access to heterogeneous information. 
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Abstract. The realization of complex distributed applications, required in areas 
such as e-Business, e-Government, and ambient intelligence, calls for new de- 
velopment paradigms, such as the Service Oriented Computing approach which 
accommodates for dynamic and adaptive interaction schemata, carried on on a 
per-to-peer level. Multi Agent Systems offer the natural architectural solutions to 
several requirements imposed by such an adaptive approach. 

This work discusses the limitation of common agent patterns, typically adopted in 
distributed information systems design, when applied to service oriented comput- 
ing, and introduces two novel agent patterns, that we call Service Oriented Orga- 
nization and Implicit Organization Broker agent pattern, respectively. Some de- 
sign aspects of the Implicit Organization Broker agent pattern are also presented. 
The limitations and the proposed solutions are demonstrated in the development 
of a multi agent system which implements a pervasive museum visitors guide. 
Some of the architecture and design features serve as a reference scenario for the 
demonstration of both the current methods limitations and the contribution of the 
newly proposed agent patterns and associated communication framework. 



1 Introduction 

Complex distributed applications emerging in areas such as e-Business, e-Government, 
and the so called ambient intelligence (i.e., “intelligent” pervasive computing [7]), 
needs to adopt forms of group communication that are deeply different from classi- 
cal client-server and Web-based models (see, for instance, [13]). This strongly moti- 
vates forms of application-level peer-to-peer interaction, clearly distinct from the re- 
quest/response style commonly used to access distributed services such as, e.g., Web 
Services adopting SOAP, XML, and RPC as communication protocol [6, 12]. The so 
called service oriented computing (SOC) is the paradigm that accommodates for the 
above mentioned more dynamic and adaptive interaction schemata. 

Service-oriented computing is applicable to ambient intelligence as a way to access 
environmental services, e.g., accessing sensors or actuators close to a user. Multi Agent 
Systems (MAS) naturally accommodate for the SOC paradigm. In fact, each service 
can be seen as an autonomous agent (or an aggregation of autonomous agents), possi- 
bly without global visibility and control over the global system, and characterized by 
unpredictable/intermitted connections with other agents of the system. However, we ar- 
gue that some domain specificities - such as the necessity to continuously monitor the 
environment for understanding the context and adapting to the user needs, and the speed 
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at which clients and service providers come and go within a physical environment pop- 
ulated with mobile devices - impose new challenging system architecture requirements 
that are not satisfied by traditional agent patterns proposed for request/response inter- 
actions. Moreover, in ambient intelligence applications, we often need to effectively 
deal with service composition based on dynamic agreements among autonomous peers, 
because group of peers collaborate at different levels and times during the service pro- 
viding process life-cycle, analogously to the product life-cycle process introduced by 
Virtual Enterprise scenarios [14]. This group communication styles should be used as 
architectural alternatives or extensions to middle agents (e.g., matchmakers and bro- 
kers), simplifying the application logic and moving context-specific decision-making 
from high-level applications or intermediate agents down to the agents called to achieve 
a goal. For these reasons, in this paper, we propose novel agent patterns, which allow for 
dynamic, collective, and collaborative reconfiguration of service providing schemata. 

To illustrate our approach, we use the notion of service oriented organization. We 
call service oriented organization (SOO) a set of autonomous software agents that, in a 
given location at a given time, coordinate in order to provide a service; in other words, 
a SOO is a team of agents whose goal is to deliver a service to its clients. Examples 
of SOO are not restricted to Web Services and ambient intelligence; for instance, they 
include virtual enterprises or organizations [14, 8], the name of which reflect the ap- 
plication area in which they have been adopted, i.e., e-Business. As well, this paper 
focuses on a special type of SOO that we call implicit organization broker (IOB), since 
it exploits a form of group communication called channelled multicast [3] to avoid 
explicit team formation and dynamically agree on the service composition. We will 
compare SOO and IOB to traditional agent patterns based on brokers or matchmakers. 
As a reference example to illustrate our ideas, we adopt an application scenario from 
Peach [15], an ongoing project for the development of an interactive museums guide. 
Using the Peach system, users can request information about exhibits; these may be 
provided by a variety of information sources and media types (museum server, online 
remote servers, video, etc.). As well, we adopt the Tropos software design methodol- 
ogy [5, 2] to illustrate and compare the different agent patterns. 

Tropos adopts high-level requirements engineering concepts founded on notions 
such as actor, agent, role, position, goal, softgoal, task, resource, belief and differ- 
ent kinds of social dependency between actors [5, 2, 1 1]. Therefore, Tropos allows for 
a modeling level more abstract than other current methodologies as, e.g., UML and 
AUML [ 1 ]. Such properties well fit with our major interest, which is in modeling envi- 
ronmental constraints that affect and characterize agents’ roles and their intentional and 
social relationships, rather than in implementation and/or technological issues. 

Section 2 briefly recalls some background notions on Tropos, on service oriented 
computing, and on agent patterns. Section 3 describes and discusses an excerpt of the 
Peach project, adopted as a reference case to illustrate our arguments. Section 5.1 tries 
to overcome some limitations of traditional patterns by proposing two new agent pat- 
terns: the Service Oriented Organization and the Implicit Organization Broker. Sec- 
tion 5.2 aims at justifying group communication as fundamental to effectively deal with 
the proposed patterns, providing a rationale view and describing dynamic aspects. Some 
conclusions are given in Section 6. 
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2 Background 

Tropos. The Tropos methodology [2, 5] adopts ideas from Multi Agents Systems tech- 
nologies and concepts from requirements engineering through the i* framework, t* is 
an organizational modeling framework for early requirements analysis [18], founded on 
notions such as actor, agent, role, goal, softgoal, task, resource, and different kinds of 
social dependency between actors. Actors represents any active entity, either individ- 
ual or collective, and either human or artificial. Thus, an actor may represent a person 
or a social group (e.g., an enterprise or a department) or an artificial system, as, e.g., 
an interactive museum guide or each of its components (both hardware and software) 
at different levels of granularity. Actors may be further specialized as roles or agents. 
An agent represents a physical (human, hardware or software) instance of actor that 
performs the assigned activities. A role, instead, represents a specific function that, in 
different circumstances, may be executed by different agents - we say that the agent 
plays the role. Actors (agents and roles) are used in Tropos to describe different social 
dependency and interaction models. In particular. Actor Diagrams (see Figures 1, 3, 
and 4) describe the network of social dependencies among actors. An Actor Diagram is 
a graph, where each node may represent either an actor, a goal, a softgoal, a task or a 
resource. Links between nodes may be used to form paths like: depender — > depenclum 
— » dependee, where the depender and the dependee are actors, and the dependum is ei- 
ther a goal, a softgoal, a task or a resource. Each path between two actors indicates that 
one actor depends on the other for something (represented by the dependum) so that 
the former may attain some goal/softgoal/task/resource. In other terms, a dependency 
describes a sort of “agreement” between two actors (the depender and the dependee), 
in order to attain the dependum. The depender is the depending actor, and the dependee 
the actor who is depended upon. The type of the dependum describes the nature of the 
dependency. Goal dependencies are used to represent delegation of responsibility for 
fulfilling a goal; softgoal dependencies are similar to goal dependencies, but their ful- 
fillment cannot be defined precisely (for instance, the appreciation is subjective, or the 
fulfillment can occur only to a given extent); task dependencies are used in situations 
where the dependee is required to perform a given activity; and resource dependencies 
require the dependee to provide a resource to the depender. As exemplified in Figure 1, 
actors are represented as circles 1 ; dependums - goals, softgoals, tasks and resources - 
are represented as ovals, clouds, hexagons and rectangles, respectively. Goals and soft- 
goals introduced with Actor Diagrams can be further detailed and analyzed by means of 
the so called Goals Diagrams [2], in which the rationale of each (soft)goal is described 
in terms of goal decompositions, means-end-analysis and the like, as, e.g., in Figure 5. 

Tropos spans four phases of Requirements Engineering and Software Engineering 
activities [5,2]: Early Requirements Analysis, Late Requirements Analysis, Architec- 
tural Design, and Detailed Design. Its key premise is that agents and goals can be used 
as fundamental concepts for all the phases of the software development life cycle. Ac- 
tor and Goal Diagrams are adopted from Early Requirements Analysis to architectural 
design. Here, we use them to describe the agent patterns we are interested in. 

1 We do not adopt any graphical distinction between agents and roles: when needed, we clarify 
it in the text. 
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Service oriented computing. Sendee Oriented Computing (SOC) [12] provides a gen- 
eral, unifying paradigm for diverse computing environments such as grids, peer-to-peer 
networks, ubiquitous and pervasive computing. A service encapsulates a component 
made available on a network by a provider. The interaction between a client and a ser- 
vice normally follows a straightforward request/response style, possibly asynchronous; 
this is the case with Web Services, which adopt SOAP, XML, and RPC [6, 12] as com- 
munication protocol. Two or more services can be aggregated to offer a single, more 
complex, service or even a complete business process; the process of aggregation is 
called sendee composition. 

As already noticed, MAS naturally accommodate for the SOC paradigm. Since each 
agent in a MAS may be either an arbiter or an intermediary for the user’s requested 
service, two common agent patterns that appear to be appropriate are the matchmaker 
and the broker (see, e.g., [ 10, 16]). 



Agent patterns for SOC. To accommodate the different settings and agents that can be 
involved, and with the different roles that - from time to time - can be played by each 
agent, a pattern based approach for the description and design of the MAS architectures 
for SOC systems can be adopted. An agent pattern can be used to describe a problem 
commonly found in MAS design and to prescribe a flexible solution for that problem, 
so to ease the reuse of that solution [11, 17,9]. The literature on Tropos adopts ideas 
from social patterns [5, 1 1] to focus on social and intentional aspects that are recurrent 
in multi-agent or cooperative systems. Here, we adopt Actor and Goal Diagrams to 
characterize MAS design patterns, focusing on how the goals assigned to each agent 2 
are fulfilled [2, 1 1], rather than on how agents communicate with each other. In the very 
spirit of Tropos, which naturally carries out the importance of analyzing each problem 
at a high abstraction level, allowing to reduce and easily manage at ‘design time’ the 
system components complexity, we aim at enhancing the reuse of design experience 
and knowledge by means of the adoption of agent patterns. 

In our context, such patterns have to cope with the important issue of locating infor- 
mation/service providers, which is an architectural requirement. Indeed, as also inves- 
tigated in [13], such a requirements strongly affect coordination issues in decentralized 
(pure) peer-to-peer scenarios. Thus, to support the peer-to-peer scenario, the match- 
maker agent pattern (see Figure la) play a key/centric role in order to allow the whole 
system for the searching and matching capabilities, e.g., see [16]. 

At the same time, the focus on the sendee providing process life-cycle puts the 
consumer in the center, and when the consumer demands novel services the system ar- 
chitecture should provide them without overwhelming her with additional interactions. 
Moreover, in a decentralized scenario, it may have several local failures may happen, 
when trying to locate new services; hence, a huge number of interactions, before reach- 
ing the related provider, are possible. Of course, the reduction of the interaction com- 
plexity decreases the customer overload. Such a requirement calls for a broker pattern 
too, as detailed in Figure lb (e.g., see [10]). 



2 Indeed, accordingly with the Tropos terminology, we should speak about roles, but we drop, 
here, this distinction, to ease the reading of the diagrams. 
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Fig. 1. a) Matchmaker agent pattern; b) Broker agent pattern, depicted by means of the Tropos 
Actor Diagrams. 



The Tropos diagram, of Figure l.a, shows that each time a user’s information/service 
request arrives 3 , Consumer depends on Matchmaker to locate good provider. On the 
contrary. Figure l.b shows that Consumer depends on Broker in order to forward re- 
quested service, that is, Broker plays an intermediary role between Provider and Con- 
sumer. In essence, both Broker and Matchmaker depend on Provider to advertise ser- 
vice^). Namely, the two patterns skills consist of mediating, among both consumers 
and providers, for some synergic collaborations to satisfy global goals. In particular, 
Matchmaker lets Consumer directly interact with Provider, while Broker handles all the 
interactions between Consumer and Provider. 



3 A Reference Scenario 

The Peach project [15] focuses on the development of a mobile museum visiting guide 
system. The whole system is a MAS, which has been developed following the Tropos 
methodology. Indeed, agents perform their actions while situated in a particular envi- 
ronment that they can sense and affect. More specifically, in the typical Peach museum 
visiting guide scenario, a user (the visitor) is provided with several environmental in- 
teraction devices. The most evident to her is a personal hand-held mobile I/O device, 
namely a PDA. Other devices include: i) passive localization hot-spots, based on tri- 
angularization of signals coming from the PDA; ii) (pro)active stationary displays of 
different sizes and with different audio output quality. Depending on the dimensions, 
the displays may be used to deliver visual/audio information (images and/or motion 
pictures possibly with audio comments; text) to a single user at a time, or to a group of 
users. 

Given this environment, let us start from the following possible user-system inter- 
action scenario: 

Example 1 (explicit communication). A museum visitor requests some information 
during her tour by using her mobile device. To deliver on such a goal, the PDA con- 

3 In the context of our simplified Peach example (see below), the Consumer is the role plaid by 
the software agent acting as the interface for the human user, that is the User Assistant. 
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Fig. 2. Overview of the actor interactions. 

tains an agent (the User Assistant) which, on behalf of the user, sends a presentation 
request to the museum central system. Here, three system actors take the responsibil- 
ity of generating a presentation: the Presentation Composer , the User Modeler and the 
Information Mediator. 

Still using Tropos, we can get to detailed design and model the communication 
dimension of the system actors. To this end, Tropos adopts AUML [1 ] interaction dia- 
grams. The communication diagram of Figure 2 presents the sequence of events from 
the time a request for presentation is issued until the presentation is presented to the 
user. The User Assistant, the Presentation Composer, and the User Modeler are generic 
roles that may be played by different software agents, e.g., there may be several differ- 
ent information mediators (for video, audio, text, pictures, animation, local and remote 
information sources and more), there may be several user assistants with different ca- 
pabilities (hand-held devices, desk-top stations, wall mounted large plasma screens and 
more), and there may also be several different user modelers implementing various 
techniques to get users’ profiles. 

In any case, here, we are not interested in the specific agents implementing such 
functionalities (i.e., playing the assigned role), but, instead, in the roles themselves. In 
fact, they - i.e., the User Assistants, the Presentation Composer , and the Information 
Mediator - form ad-hoc service-oriented organizations, in order to achieve the service 
goal. Each SOO is characterized by members that collaborate at different levels and 
times during the service providing process life-cycle. After the goal is satisfied, the 
organization is dissolved and a new one will be formed - possibly including different 
agents, provided they play the listed roles - to serve a new request. 

3.1 Discussion 

The previous section motivates the need of some agent patterns to effectively deal with 
distributed computing issues (e.g., see [11, 17, 16, 10]). 

Nevertheless, if we proceed by adopting traditional agent patterns, as, e.g., the 
matchmaker and broker introduced in Section 2, probably we could not be able to cap- 
ture few but interesting and vital architectural requirements that arise from our ambient 
intelligence scenario, specially if we want to fully exploit the flexibility - in terms of 
self organizing presentation delivery channels - that can be provided. 
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In particular, to motivate our assertion, let us consider the following new scenario: 
Example 2 (implicit communication). Let us assume that, while walking around, the 
user is approaching some presentation devices that are more comfortable and suitable 
to handle the presentation than the mobile device (User Assistant), e.g., in terms of pixel 
resolution and audio quality. So, we may assume the User Assistant is autonomously 
capable to exploit its intelligent behavior by negotiating the most convenient presenta- 
tion, on behalf of its human owner. Let us also assume that there are several different 
Presentation Composers for each single device (capable to generate video, text, ani- 
mated explanation, audio, etc.) and that each Presentation Composer relies on different 
Information Mediators to provide the information required for presentation generation. 
Moreover, we may also assume that each Presentation Composer is able to proactively 
propose its best services (in terms of available or conveniently producible presentations) 
to the User Assistant , possibly through some mediation interface. As well, we expect 
that all the services (negotiated or proposed) are “dynamically validated”, that is, due 
to the fact that the environment and the user location are quickly changing, only the 
appropriate services are considered. 

Such a scenario calls for architecture flexibility in terms of dynamic group recon- 
figuration to support SOOs involvement. Traditional approaches allow for intentional 
relationships and request/response communication protocols among single agents only, 
and not among group of agents [9-11, 17]. More specifically, we may assume that the 
User Assistant starts an interaction session that triggers the involvement of a group of 
system actors all with the ability of Presentation Composer, which in turn trigger the 
involvement of a group of system actors all with the ability of Information Mediator. 
Each Presentation Composer, instead, relays on the User Modeler to know the user 
profile to correctly build up a user-tailored presentation. 

Therefore, such an architecture has to adopt group communication in order to sup- 
port an ‘intelligent’ pervasive computing model among users’ assistant devices and the 
system actor information/service providers. To cope with these new challenges, we can 
imagine that the system agents exploit a form of ‘implicit communication’, where they 
can autonomously build up SOOs in order to satisfy a request at the best they can do 
at that time. This is not possible by means of traditional approaches that adopt simple 
request/response based communication styles (e.g., [ 16]). In fact, as shown in Figure 1, 
using classical matchmaker and broker approaches, we assume that there is an advertise 
service dependency (e.g., based on a preliminary registration phase) forcing the system 
actors to rely on a centralized computing model. 

4 Agent Patterns-Based Detailed Design 

The discussion above highlights the limits of traditional patterns when applied to our 
ambient intelligence pervasive computing scenario; hence, the necessity of characteriz- 
ing our system architecture by means of new agent patterns. 

4.1 The Service Oriented Organization 

In distributed computing and especially in ‘intelligent’ pervasive computing based sce- 
narios, each time an information consumer explicitly or implicitly causes a specific 
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Fig. 3. Actor Diagram for the Sen’ice Oriented Organization pattern. 



service request, it inherently needs a searching capability in order to locate the service 
provider. In particular, in our scenario, when the User Assistants is looking for a Pre- 
sentation Composer in order to ask for a personalized presentation, a matchmaker (e.g., 
the one presented in Section 2) or a facilitator architecture is required [ 10, 1 1 , 13, 16]. 

As previously discussed, the matchmaker pattern illustrated in Section 2 does not 
completely fit the requirements of our pervasive computing scenario (Example 2). Here, 
we define a new agent pattern - the Service Oriented Organization pattern - illustrated 
in Figure 3, which extends and adapts the matchmaker pattern of Figure l.a. Here, the 
actor Matchmaker is replaced by Organization Matchmaker, which is further decom- 
posed in two component system actors: Service oriented Organization and Initiator. The 
dependencies between Consumer and Organization Matchmaker (or, more specifically, 
Initiator) and between Consumer and Provider(s) are as before. The main difference, in- 
stead, is that now there is no advertise service goal dependency between Organization 
Matchmaker and Provider(s). In fact, our scenario call for dynamic group reconfigura- 
tion, which cannot be provided on the basis of a pre-declared and centrally recorded set 
of service capabilities, as foreseen in the classical matchmaker approach. The solution 
we propose, instead, is based on a proactive and, specially, dynamic capability of service 
proposal, on the basis of the actual, current requests or needs of services. In particular, 
our system low level communication infrastructure is based on a group communication, 
which has been designed to support channelled multicast [3]. That is, a form of group 
communication that allows messages addressed to a single agent or a group of agents 
(Provider(s)) to be received by everybody tuned on the channel, i.e., the agent “intro- 
spection” capability described in Section 5. Thus, Provider(s) depends now on Organi- 
zation Matchmaker, or, more specifically, on Providers Organizer to have a call for ser- 
vice. That is, because of each SOO member adopts an IP channelled multicast approach 
that allows to overhear on channels (see Section 5 for details), the organizer simply 
sends its service request message on a specific channel and it waits for some providers 
offers 4 . On the basis of such calls, Provider(s) may notify their current services avail- 
ability. Thus, the Providers Organizer depends on Provider(s) for propose service and, 

4 In fact, channels are classified by topics and each provider is free to overhear on the preferred 
channels according to its main interest and capabilities. 
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vice-versa, Provider(s) depend on the Providers Organizer for the final agreement on 
service provision (goal agree service). Moreover, in an ‘intelligent’ pervasive comput- 
ing based scenario, the system awareness allows to proactively propose services to the 
consumer without any explicit service request. Thus, Initiator acts as interface towards 
Consumer. It is able to interpret Consumer’s requests and, specially, proactively pro- 
pose not explicitly requested services, on the basis of Consumer’s profile and previous 
interaction history 5 . To this end, Initiator depends on Providers Organizer to get new 
acquaintances about Provider(s) and their services, while Providers Organizer depends 
on the Initiator to formulate request. In this way, we can drop the dependency Provide 
service description between Matchmaker and Consumer, which is instead present in the 
traditional matchmaker pattern. Finally, Initiator requires that Provider(s) timely notify 
service status in order to only propose active services. 

4.2 The Implicit Organization Broker 

As observed in Section 1, ambient intelligence environments are often characterized by 
intermitted communication channels. This problem is even more relevant when proac- 
tive broadcasting is adopted, as in the scenario suggested by Example 2. In this case 
communications to/from the User Assistant need to be reduced at a minimum. To this 
end, we propose here to exploit the implicit communication paradigm towards the adop- 
tion of an implicit organizations broker (IOB) agent pattern that is inspired to the im- 
plicit organization introduced by [4], That is, we define the IOB as a SOO formed by 
all the agents tuned on the same channel to play the same role (i.e., having same com- 
munication API) and willing to coordinate their actions. The term ‘implicit’ highlights 
the fact that there is no group formation phase - since joining an organization is just 
a matter of tuning on a channel - and no name for it - since the role and the channel 
uniquely identify the organization. Its members play the same role but they may do it 
in different ways; redundancy (as in fault tolerant and load balanced systems) is just a 
particular case where agents happen to be perfectly interchangeable. In particular, we 
can consider to have implicit organizations playing a kind of broker role. In other terms, 
each time the system perceives the visitor’s information needs, the system actors set up 
a SOO (as described in Section 4.1), which, in addition to the already presented match- 
making capabilities, can also manage the whole service I/O process; that is, the SOO 
is able to autonomously and proactively cope with the whole service providing process 
life cycle. Such a system ability enhances the ambient intelligence awareness, a system 
requirement that cannot be captured by adopting traditional agent patterns [10, 1 1 ]. 

Figure 4 introduces a IOB pattern as a refinement/adaptation of the SOO pattern in- 
troduced in Section 4. 1 . Provider(s) are now part of the organization itself, which plays 
the role of an Organization Broker. Thus, the latter include both Providers Organizer 
and Provider(s) (see the inside of the dashed-line rectangle). It is worth noticing that the 
IOB members are characterized by the same (required) skill (see ahead Section 5). 

The differences between the two traditional agent patterns of Figure 1 are naturally 
reflected also between the two patterns illustrated in Figures 3 and 4. In particular, Fig- 

5 For example, every system actor, through environmental sensors, can perceive and profile users 
during their visits across museum media services, as in the scenario of Example 2. 
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Fig. 4. Actor Diagram for the Implicit Organization Broker (IOB) pattern. 



ure 3 tries to capture intentional aspects for more general group communication scenar- 
ios, i.e., general SOO. On the contrary. Figure 4 gives a level of pattern based detailed 
design focusing more on special kind of SOO, tailor-made for ambient intelligence 
scenarios. In other words, Figure 3 does not consider a strictly ‘intelligent’ pervasive 
computing scenario that, on the contrary, characterizes our IOB of Figure 4. 

As well, it is worth noticing that the IOB pattern incorporate in the Initiator role 
both the roles of Consumer and Initiator of the SOO pattern. As already said, this is a 
consequence of the fact that, in ambient intelligence, some system actors concurrently 
play the consumer and initiator roles, which allows the system to enhance autonomy 
and proactivity skills. Moreover, and similarly to what happen in Figure 1 .b between the 
Consumer and the Broker, in Figure 4, the Initiator depends on the Organization Broker 
- or, more specifically, on the Providers Organizer - to forward requested service, in 
order to avoid User Assistant message/interaction overloading. Nevertheless, the IOB 
pattern allows for acquaintance increasing (for Initiator), so to consent a more precise 
service requests during future interactions, as already foreseen for the generic Service 
Oriented Organization pattern. 

5 Supporting Implicit Organization Brokers 

The two agent patterns Service Oriented Organization and the Implicit Organization 
Broker presented so forth have been experimented within the Peach project to build 
an interactive, pervasive, museum guide. As mentioned, our patterns require a group 
communication infrastructure. To this end, we adopt the LoudVoice [4] experimental 
communication infrastructure based on channelled multicast and developed at our insti- 
tute. Specifically, LoudVoice uses the fast but inherently unreliable IP multicast - which 
is not a major limitation in our domain, since the communication media in use are unre- 
liable by their own nature. However, we had to deal with message losses and temporary 
network partitions by carefully crafting protocols and using time-based mechanisms to 
ensure consistency of mutual beliefs within organizations. 
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Fig. 5. Goal Diagram for an agent’s role characterization by means of its capabilities. 



5.1 Agent Roles Characterization 

Analyzing agent roles means figuring out and characterizing its main capabilities (e.g., 
internal and external services) required to achieve its intentional dependencies already 
identified by the agent patterns analysis of Section 4. Note that, a capability (or skill) 
is not necessarily justified by external requests (like a service), but it can be an internal 
agent characteristic, required to enhance its autonomous and proactive behavior. To deal 
with the rationale aspects of an agent at ‘design time’, that is, in order to look inside 
and to understand how an agent exploits its capabilities, we adopt the Goal Modeling 
Activity of Tropos [2], In Figure 5, we adopt the means-end and AND/OR decomposition 
reasoning techniques of the Goal Modeling Activity [2, 5, 18]. Means-end analysis aims 
at identifying goals, tasks, and resources that provide means for achieving a given goal. 
AND/OR decomposition analysis combines AND and OR decompositions of a root 
goal into subgoals, modeling a finer goal structure. Notice that, we have modeled every 
agent capability as a goal to be achieved. 

For the sake of briefness, here we consider only the IOB pattern. According to Fig- 
ures 5 and 4, each time Initiator formulates a request, Providers Organizer achieves its 
main goal cope with the request (i.e., the goal that Providers Organizer internally adopts 
to satisfy Initiator’s request) relying on its three principal skills: define providers, deal 
with fipa-acl performatives, and support organizational communication. The principal 
goal success depends on the satisfaction of all the three goals (i.e., AND decomposi- 
tion). For the sake of simplicity, Figure 5 does not consider Initiator and its intentional 
relationships. An adequate organizational communication infrastructure is used to en- 
hance the system actor autonomous and proactive behavior by means of group commu- 
nication based on channelled multicast [3] (see goal provide channelled multicast) that 
allows messages to be exchanged over open channels identified by topic of conversa- 
tion. Thus, a proper structuring of conversations among agents allows every listener to 
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capture its partner intentions without any explicit request, thanks to its agent introspec- 
tion skill (see goal allow agents introspection). Indeed, each actor is able to overhear on 
specific channels every ongoing interaction; hence, it can choose the best role to play in 
order to satisfy a goal, provide a resource, perform a task, without any external decision 
control, but only according to its internal beliefs and desires. 

Exploiting the provide channelled multicast ability, each actor can decide by itself 
what channels to listen to, by means of a subscription phase (represented by the tasks 
discover channels and maintain a channel list). This communication mechanisms well 
support group communication for service oriented and implicit organizations composed 
by members with the same interests or skills. Such organizations assist the User Assis- 
tant avoiding it to know how directly interact with the museum multi-media services. 

5.2 Group Communication: Dynamics 

As described earlier, the museum visitor guide system is composed of several different 
types of agents. Rather than by individual agents, most components are formed by a 
group of coordinated agents, as presented by Example 2 of Section 3. Modeling this 
example requires the representation of implicit organizations, which cannot be done 
by a regular AUML communication diagram, as presented, e.g., in [2,5]. Therefore, 
in Figure 6, we propose a new type of diagram that deals with the group communi- 
cation features required by the scenario introduced with Example 2. Here, the shaded 
rectangles and the dashed lines below them represent the implicit organizations, and 
the gray rectangles represent the communication internal to implicit organizations. Re- 
quests sent to an organization are presented as arrows terminating in a dot at the border 
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of the organization; the organization reply is presented by an arrow starting from a dot 
on the organization border. Obviously, we consider an asynchronous message-based 
communication model. 

In the example diagram of Figure 6, the request for presentation is initiated by a cer- 
tain User Assistant on behalf of a specific user. The request is addressed to the Implicit 
Organization of Presentation Composers. Presentation composers have different capa- 
bilities and require different resources. Hence every presentation composer requests 
user information on the user model and presentation data (availability, constraints, etc.) 
from the Implicit Organization of Information Mediators. In turn, the implicit organiza- 
tion of information mediators holds an internal conversation. Each member suggests the 
service it can provide. The “best” service is selected and returned, as a group decision, 
to the requesting presentation composer. At this stage, the presentation composers re- 
quest additional information to the Implicit Organization of User Assistants, regarding 
the availability of assistants capable to show the presentation being planned. When all 
the information has been received, the implicit organization of presentation composers 
can reason and decide on the best presentation to prepare. This will be sent from the 
composers as a group response to the selected (user) assistant. 

6 Conclusions 

Ambient intelligence scenarios characterized by service oriented organizations, where 
group of agents collaborate at different levels and times during the service providing 
life-cycle, generates new software architectural requirements that traditional agent pat- 
terns cannot satisfy. For example, ‘intelligent’ pervasive computing and peer-to-peer 
computing models naturally support group communication for ambient intelligence, 
but they also call for architecture flexibility in terms of dynamic group reconfigura- 
tion. Traditional request/response communication protocols are not appropriate to cope 
with service negotiation and aggregation that must be ‘dynamically validate’, since the 
environment conditions and the user location are quickly changing. 

For such reasons, we propose two new agent patterns (Service Oriented Organiza- 
tion and Implicit Organization Broker) and compare them with traditional patterns [10, 
11]. Specifically, we adopt the agent oriented software development methodology Tro- 
pos [2,5], to effectively figure out the new requirements. For example, using Tropos, 
we can keep the agent conversation and social levels independent from complex co- 
ordination activities, thanks to an inherently pure peer-to-peer computing model [13]. 
Such a way of modeling has been thought for enriching the Tropos methodology de- 
tailed design phase with new capabilities, more oriented towards sophisticated software 
agents, which requires advanced modeling mechanisms to better fit group communica- 
tion, goals, and negotiations. Thus, we have been able to capture important aspects of 
ambient intelligence requirements and to build up new agent patterns, more flexible and 
reusable than the traditional ones. 
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Abstract. The post Human Genome Project era calls for reliable, integrated, 
flexible, and convenient data management techniques to facilitate research ac- 
tivities. Querying biological data that is large in volume and complex in struc- 
ture such as 3D proteins requires expressive models to explicitly support and 
capture the semantics of the complex data. Protein 3D structure search and 
comparison not only enable us to predict unknown structures, but can also re- 
veal distant evolutionary relationships that are otherwise undetectable, and per- 
haps suggest unsuspected functional properties. In this work, we model 3D pro- 
tein structures by adding spatial semantics and constructs to represent the 
contributing forces such as hydrogen bonds and high-level structures such as 
protein secondary structures. This paper makes a contribution to modeling the 
specialty of life science data and develops methods to meet the novel chal- 
lenges posed by such data. 



1 Introduction 

The Human Genome Project and its concomitant research have provided the scientific 
community with data that is increasing in volume and complexity. It has generated a 
precious information pool that can be used to support the best interests of human beings. 
To exploit this information, we need new ways to manage, integrate, and present the 
data so complex questions can be answered effectively. To do so, we need expressive 
data models that can capture the semantics of the wide variety of biological data. 

Proteins are large biological molecules with complex structure and they constitute 
much of the bulk of living organisms. In order to understand the life processes of an 
organism, it is necessary to first know the functions of the proteins. Since the function 
of a protein in a given environment is determined by its structure, we need to know 
the structure of the molecule to fully understand its function. The success of the Hu- 
man Genome Project generated multiple protein databases including protein sequence 
databases and protein 3D structure databases [8]. Each 3D structure stored in the 
databases is either determined by experimental methods such as X-ray crystallogra- 
phy and Nuclear Magnetic Resonance or by computational chemistry [22]. Research- 
ers need to search these databases for specific structures or compare structures with 
each other to seek similarities. Similar sequences can result in similar 3D structures, 
and similar structures perform similar functions. Therefore, protein structure similari- 
ties may be predicted based on sequence similarities. More importantly, search for 
similar protein structures can help us find homologs that sequence searches cannot 
discover, and, homologs often conserve structure more strongly than sequence. Also, 
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we can explore protein evolution because similar protein folds can be used to support 
different functions [13]. Meanwhile, if we can identify conserved core elements of 
protein folding, the information can be used to model related proteins of unknown 
structures. Being able to determine, search and compare 3D protein structures, rather 
than just comparing sequences, is thus becoming very important in the life sciences. 
However, structure comparisons are currently a big challenge. We address this issue 
in our research by developing a semantic model. We believe our semantic model can 
facilitate the development of techniques and operators for 3D structure searching and 
comparison. In this paper, we extend our previous work on semantic modeling of 
DNA sequences and primary protein sequences to three-dimension (3D) protein 
structures with semantics that are novel using extended Entity-Relationship Modeling 
[10]. When studying proteins, scientists investigate not only the amino acids subunits 
that form a protein and their order, but also how the sequences fold into 3D structures 
in certain ways due to chemical forces. Currently, protein structure data is stored in 
plain text files that record the three-dimensional coordinates of each non-hydrogen 
atom as well as a small part of the substructures. The text file formatted data doesn’t 
capture biological meaning. Comparison and search over structure data have to be 
done using visualization tools and extra software tools running on various algorithms. 

We define the semantics of primary, secondary, tertiary and quaternary structures 
of proteins by describing their components, chemical bonding forces, and spatial 
arrangement. To model the 3D structure of a protein and its formation, we need to 
explicitly represent the spatial arrangement of each component in addition to its se- 
quential order, along with its associated biological information. Our semantic model 
captures the semantics of protein data and specifies it using an annotation-based ap- 
proach to capture all of this semantic information in a straightforward way. 

The rest of the paper is organized as follows. Section 2 provides a brief back- 
ground about proteins and their various levels of structures, and describes the seman- 
tics of such structures. In section 3 we describe entity classes and new constructs to 
represent the semantics of protein structures and develop annotations to capture their 
spatial arrangement and biological characteristics. Also, we briefly review related 
research and justify why it is necessary to develop new semantic constructs to model 
protein structures. In section 4, we describe the utility of our semantic model, demon- 
strate its application and point out extensibility of the model for other similar fields. 
We conclude with a discussion of future research directions in section 5. 

2 Background 

2.1 Protein Structures 

Proteins are the most important macromolecules in the factory of living cells that 
perform various biological tasks. Basically, a protein is composed of various numbers 
of 20 kinds of amino acids, also known as subunits or residues (see Figure 1). These 
residues are arranged in a specific order or sequence; each amino acid is denoted by a 
letter of the English alphabet [6] . 

Multiple amino acids bond together through condensation reaction to form amino 
bonds (see Figure 2) which connect subunits into a protein sequence. One protein 
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sequence can range from 10 to 1000 amino acids (residues). The actual proteins are 
not linear sequences; rather they are 3-D structures. Knowing this 3-D structure is a 
key to understanding the protein’s functions and for using it to improve human lives. 




= H.CH 3 ... 




Fig. 1. General Structure of Amino Acids. 
Different side chain-R determines the type of 
amino acid. Each amino acid contains an 
amino group and a carboxylate group. Hydro- 
gen atom on amino group reacts with hydroxyl 
on carboxylate group through condensation to 
form amino bonds 
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Fig. 2. Structure of Amino Bonds. Two amino 
acid residues (subunits) are shown 



The general principle that protein sequences follow to fold into 3-D structures is 
that “The three-dimensional structure of a native protein in its normal physiological 
milieu is the one in which the Gibbs free energy of the whole system is lowest; that 
is, the native conformation is determined by the totality of inter-atomic interactions 
and hence by the amino acid sequence, in a given environment” [11]. Following are 
descriptions for the four levels of protein structures [6]. 

Primary Structure. The primary structure of a protein refers to the exact sequence of 
amino acids in a protein. Hence the primary structure is linear in nature; it says noth- 
ing about the spatial arrangement of amino acids or their atoms. It merely shows the 
specific amino acids used to compose the protein and their linear order. 

M-H-G-A-Y-R-T-P-R-S-K-T-D-A-Y-G-C 

Fig. 3. Primary Protein Structure 



Secondary Structure. The covalently linked amino acids are further organized by 
forming regularly repeating patterns such as a helix and (3 sheet and other less popu- 
lar structures [16]. A hydrogen bond is the cause of the secondary structure. More 
specifically, it is the spatial interaction between a hydrogen atom in an N-H group 
and a nearby highly electro-negative carbonyl 
oxygen as shown in Figure 4. Each atom partici- 
pating in the formation of a hydrogen bond is 
from a different residue, the distance between 
these residues determines the possible category of 
its secondary structure. The secondary structure is 
the base on which tertiary and quaternary struc- 
tures are formed. 3D structure search and com- 
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Fig. 4. Formation of Hydrogen 
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parison start from the secondary structure level [19]. One protein sequence or a chain 
can contain multiple different secondary structures. Each of these secondary struc- 
tures is formed on a segment of the primary sequence. Several adjacent secondary 
structures form a motif and motifs can group to form domains. 

Tertiary Structure. Several motifs typically combine to form a compact globular 
structure referred to as a domain. The tertiary structure is used to describe the way 
motifs are arranged into domain structures and the way a single polypeptide chain 
folds into one or several domains. The side chain interactions contributing to the 
formation of domains or tertiary structures include: (a) Hydrogen bonds between 
polar side chain groups; (b) Hydrophobic interactions among non-polar groups; (c) 
Salt bridges between acidic and basic side chains; and, (d) Disulfide bridges. Here, 
each “chain” is a protein sequence recorded in protein sequence databases. 

Quaternary Structure. For proteins with more than one chain, interactions can oc- 
cur between the chains themselves. The forces contributing to the quaternary struc- 
ture are of the same kinds as those for tertiary structures, but they are between chains, 
not within chains. 

To summarize, various inter- and intra-molecular forces work together to decide 
the least energy/most stable structure of the proteins. The structure determines how 
various biological tasks are performed. It is therefore important to represent the se- 
mantics of these structures so they can be queried easily. 

2.2 Current Protein Structure Databases and Usage 

The Protein Data Bank, PDB (http://www.rcsb.org/pdb/) is the only worldwide ar- 
chive of experimentally determined (using X-ray Crystallography and Nuclear Mag- 
netic Resonance techniques) three-dimensional structures of proteins [2]. It is oper- 
ated by the Research Collaboratory for Structural Bioinformatics (RCSB). There are 
26485 structures stored in PDB as of now with the number and complexity of the 
structures increasing rapidly. 

The format of data stored in PDB consists of a HEADER followed by the data. 
There are two major categories of data. The first category includes the identifier of 
each protein, the experiment (which determined its structure), the authors, keywords 
and references etc. The other more important category is the x, y and z coordinates of 
each non-hydrogen atom (heavy atom) in the structure. This format records the pro- 
tein sequence and the composition of its secondary structure. It does not record the 
tertiary and quaternary structures, which as stated earlier, are very important parts of 
the protein structure. The core of the protein data lies in its coordinate data or its 
spatial arrangement. However, spatial coordinates by themselves depict nothing more 
than the shape and provide no biological value, unless they can be related to other 
information such as the shape, the strength (or energy), and the length of the chemical 
bonds between the various subunits of a protein. This is very important in structural 
genomics [12] and is used to connect spatial data with whole-genome data and to 
relate it to various biological activities. 

Researchers use experimental methods or computational calculation to determine 
or predict the structure of proteins, and submit their data to PDB [5, 23], PDB proc- 
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esses the data based on the mmCIF (Macromolecular Crystallographic Information 
File standard) dictionary that is an ontology of 1700 terms that define the macromo- 
lecular structure and the crystallographic experiment [4]. Then the data is stored in 
the core database. 

Structure data stored in flat files poses challenges to effective and efficient data re- 
trieval and analysis. Because data in flat file formats is not machine-processable, 
specialized structure search and comparison software tools are required to access and 
analyze the data. Each of the tools available has its own interface supporting slightly 
different invocation, processing and data semantics. This makes using the tools diffi- 
cult because the input must be guaranteed to be in the correct format with the correct 
semantics and the tools must be invoked in specific, non-standard ways [7]. Primary 
protein structure or protein sequence search and comparison software tools use the 
BLAST (Basic Local Alignment Search Tool) algorithm [1], while 3D structure 
search and comparison software tools use more or less similar algorithms [15, 20] 
based on VAST (Vector Alignment Search Tool) [18]. Meanwhile, co-existence of 
multiple search tools makes one-stop search impossible. Inefficient structure search 
and comparison has become the bottleneck for high throughput experiments. Our 
research focuses on how to represent 3D structure semantics in a conceptual model 
which can be used for seamless interoperation of multiple resources [21], More im- 
portantly, operators [9] can be developed based on the semantic model to facilitate 
query and analysis of structure data. Therefore, using extra software tools would 
become unnecessary and the bottlenecks can be eliminated. 

In this paper, we propose new constructs to explicitly represent the semantics of 
protein structures. We believe our model will help facilitate the ability of scientists to 
understand and detect the link between protein structures and their biological func- 
tions. Our proposed model provides formally defined constructs to represent the se- 
mantics of protein structure data. The spatial elements are represented using annota- 
tions. Our semantic model will also aid the development of tools to process ad hoc 
queries about protein structures. It can also be used as a canonical model to virtually 
unify protein structure data [3, 23]. 

3 Proposed Semantic Model 

3.1 Entity Classes 

Atoms. This entity class is used to model chemical atoms (C, H, O, N and heteroa- 
toms such as S) in the protein structure with each of them identified uniquely by an 
ID. Each atom in the structure can be represented by a 3-tuple A <as, an, ty>. The as 
element represents the atom’s serial number and as e AS where AS is the collection 
of all atom serial numbers. The an element is the atom name and an e AN where AN 
= [C, H, O, N, S}. The last element ty is the atom type and ty e TY where TY= [C a , 
N, C c , O c , H s , H n , H ca , S). Each element in TY is a representative category of atoms 
in an amino acid. C (/ is a carbon, N is the nitrogen atom, C c is the carboxylate carbon, 
O c is the carboxylate oxygen and H s , H n , H ca are side chain hydrogen, amino hydro- 
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gen and C a hydrogen respectively. S is sulfur atom that contributes to structure for- 
mation such as disulfur bridges. 

Residues. Each residue (amino acid subunit) is an aggregate of a set of component 
atoms and a set of intra-residue bonds. Each residue can be represented using a 4- 
tuple R < rs, rn, { a 4 | a ; e A}, { B (aj, a-), BL, BE | a i; e A } >. Element rs is the 
residue serial number, and rs e N a rs < SL, where SL is the length of the protein 
sequence or the total number of residues. The second element rn is the residue name, 
rn e AA where AA is the set of all types of amino acid residues. AA = { A, R, N, D, 
C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V). {a ; | a ; e A} is the set of atoms in the 
residue identified by their atom serial numbers. {B(a ; , ad, BL, BE | a i; a, e A} is the 
set of internal bonds in the residue. B(a i? ap is the bond between atom i and atom j. 
BL and BE are used to record the bond length and bond energy. This structure data is 
determined by experimental or computational methods. 

Primary Structure. The Primary structure of a protein is the sequence of amino 
acids comprising the protein. It describes what amino acids subunits are used in 
which order to form specific protein sequence. It can be represented as a 3-tuple PS < 
psid, pn, psl >, where psid is the unique identifier for each protein sequence record in 
the database, pn is the biological name of the protein and psl is the length of the pro- 
tein sequence. At the same time, psid e PSID where PSID is the collection of all 
available identifiers for protein sequences. Also, pn e PN where PN is the collection 
of the names of all proteins. 

Segment. A complete protein sequence can be fragmented into several segments and 
each segment is folded to form certain higher-level structures. A segment is defined 
as seg <segid> where segid e N and seg e SEG. SEG is the set of all segments 
fragmented from a single protein sequence. More information about each segment is 
represented by the Fragment relationship between Primary Structure and Segment. 

Forces. This entity class represents the four chemical forces that contribute to the 
formation of secondary structures, namely, hydrogen bonds, disulfur bridges, salt 
bridges, and hydrophobic interactions. FORCES is a superclass with four subclasses 
where each of them represents one of the four types of chemical forces. Hydrogen 
bond is the focus of our study, but by capturing the other three types of forces in our 
model, we allow for future expansion. 

Hydrogenbonds. A hydrogen bond is modeled as an entity class because it is the 
main cause of secondary structures based on which protein sequences fold into spe- 
cific spatial arrangements. HYDROGENBONDS as a superclass has two subclasses - 
BACKBONE and SIDECHAIN. BACKBONE represents hydrogen bonds formed by 
backbone hydrogen atoms. SIDECHAIN represents hydrogen bonds formed by non- 
backbone hydrogen atoms. We focus on backbone hydrogen bonds in this paper; 
sidechain hydrogen bonds are included in the model for future investigation. Each 
hydrogen bond in the structure can be depicted as a 3-tuple hb < (H ni , O c j), BL, BE>. 
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The first element records the hydrogen bond formed between a hydrogen atom from 
an amino group and an oxygen atom from a carboxylate group, i is the serial number 
of the residue which donates the hydrogen atom and j is the serial number of the resi- 
due which donates the oxygen atom where | i - j | > 3 because the two residues have 
to be at least 3 units away from each other to form a hydrogen bond. Different 
distances between amino acids cause different forms of secondary structures [16]. For 
each protein structure, there is a set of hydrogen bonds formed within the structure 
HB and HB=[ hb m | hb m = < (H ni , O cj ), BL, BE> a i, j, m e N a i, j < SL a | i-j|>3}. 

if ltcy-clm- Serai \n I 'L- 
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Fig. 5. Semantic model of protein 3D structure 



Secondary Structure. A secondary structure is formed by hydrogen bonding. Multi- 
ple hydrogen bonds make a segment of the protein sequence fold in specific ways. 
Therefore, each secondary structure corresponds to a set of hydrogen bonds. The type 
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of hydrogen bonds in the set determines the type of the secondary structure i.e. a 
helix and (3 sheet. The contributing hydrogen bond set is defined as a set HB n where 
HB n <z HB. If | i — j | = n and n e { 3,4,5 }, then the resulting structure is called n-tum. 
Multiple adjacent turns form a a helix. Minimal helices overlapping can form longer 
helices. Other types of structures can also be defined using i and j. The geometry of a 
secondary structure can be represented as a vector which is the basis for VAST. Each 
vector is represented as a 4-tuple <VL(A), VS(x,y,z), VE(x,y,z), VMP(x,y,z)>. VL is 
the vector length in angstrom, VS(x,y,z) are the internal coordinates of the starting 
point of the vector, VE(x,y,z) are the internal coordinates of the ending point of the 
vector and VMP(x,y,z) are the internal coordinates of the middle point of the vector. 
These values of a secondary structure element (SSE) need to be defined because 
structure comparison is based on vector geometry. Within each protein structure, all 
SSEs group together to form a set SSE where SSE = { sse ; | sse ; = <VL(A), 
VS(x,y,z), VE(x,y,z), VMP(x,y,z)> a i e N }. 



j SECONDARY STRUCTURE 

1 //VL(A)//VS(x,y,z)//VE(x,y,z)//VMP(x,y,z) ✓ Spatial-Aggregate \ 

I V/P(deg)/P(deg)/P(deg)//T(c) J 

Fig. 6. Secondary Structure Entity Class Fig. 7. Spatial-Aggregate Relationship 

Motif. Secondary structure elements usually arrange themselves in simple motifs. 
Motifs are formed by packing side chains from adjacent secondary structures such as 
a helices and (3 sheets that are close to each other. Therefore, a motif is a cluster of 
SSEs and we define it as M={ssej|ssej = <VL(A), VS(x,y,z), VE(x,y,z), VMP(x,y,z)> 
a j e N } and M c SSE. The relative spatial relationship between pairs of SSEs can 
be defined by six geometrical variables (see Figure 12). 

Domain and Quaternary Structure. This entity class represents a compact globular 
structure composed of several motifs. It is defined as D = { M k | M k e M a M c SSE 
a k e N }. When the protein of interest actually has more than one polypeptide chain, 
the interaction among chains will fold them further to form quaternary structures. 

3.2 Relationships 

Spatial-Aggregate. This construct (see Figure 7) is defined to capture the spatial 
arrangement of each atom in an amino acid to form the protein’s 3-D structure. This 
is similar to the normal aggregate in existing spatial semantic models only in this case 
we want to represent the x, y and z coordinates of each atom. Each atom can be de- 
picted as a point. //P(deg)/P(deg)/P(deg) suggesting that the position of this point can 
be measured using x, y and z coordinate in degrees. //T(c) is another dimension de- 
noting the temperature (Celsius) at the time the structure was determined, because 
temperature changes affects the activities of atoms and relative positions change de- 
pending on the temperature. The relationship can be more concretely represented as 
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spa < a,, ij, <x, y, z» and spa e SPA where SPA is the complete set of spatial aggre- 
gate relationships within each structure, SPA = { spa, | spa, = < a,, ij, <x, y, z» a a, e 
AaTjG R a I e N}. The constraint a, e A a r, e R states which atom composes a 
which residue at what position, where x, y and z are the coordinates of each atom. 

Sequential Aggregate. This is a concept borrowed from our previous work to model 
DNA sequences and primary protein sequences. This construct represents the fact that 
multiple amino acids/residues are bonded together in a specific linear order to form a 
sequence, sea <r,, psj, X> represents which residue is at which position of which 
protein sequence where sea e SEA and SEA = { sea m | sea m = <r ; , pSj, X> a r, e R a 
pSj e PS1D a X < SL a m e N }. SL represents the sequence length and X is an inte- 
ger that represents the position of this residue in the sequence, where the position 
number has to be less than or equal to the length of the sequence. 




Fig. 8. Sequential-Aggregate Relationship Fig. 9. Fragment Construct 

Fragment. This relationship can be considered the exact opposite of the aggregation 
relationship. The “O” in this relationship represents the fact that segments can over- 
lap, or one segment can contain another. The formal definition of this relationship is 
frag < psid, segid, sp, ep> and frag e FRAG where FRAG = { frag n | frag n = <psid, 
segid, sp, ep> a psid e PSID a segid, n e N, sp, ep < SL } . It says which segment is 
fragmented from which protein sequence starting and ending at what points. The 
length of the segment can be easily derived by subtracting the starting point from the 
ending point. A complete protein sequence contains several segments at different 
levels, where each segment contributes to a higher-level structure. For example, a 
segment of size 4 can form a 4-turn, several adjacent 4-turns can group together to 
form a helix. Helices further can group together to form motifs. 

Spatial-Bonding. This relationship is used to describe how atoms in the structure 
form forces that contribute to the formation of secondary structure. A1 and A2 are the 
two atoms participating in the force. For hydrogen bonds, it specifies which two resi- 
dues contribute to the bond. This is specified as: //HB (Hi, Oj)//BL(A)//BE(kcal/mol). 
BL records the bond length and BE is bond energy of the chemical force. From the 
value of i and j, the type of fold (i.e. a and (3) can be determined. 



Spatial-Bonding 



V/FORCKS (A|, A ’ )//BI .( A )//B H( Icca l/mol ) / 



Fig. 10. Spatial-Bonding Relationship 
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( Spatial Composition 

{SSEP(SSE m , SSE n )=(VL m (A), VL n (A), VD(m,n)(A), BA m (deg), BA n (deg), DA(m,n)(deg)) | m != n} 

Fig. 11. Spatial Composition Relationship 

Spatial Composition. This is a very complicated relationship that captures all geo- 
metrical data required to compare 3D structures based on the arrangement of secon- 
dary structures. As mentioned earlier, each secondary 
structure can be geometrically represented by a vector. 

Consider the two vectors Vm and V n in Figure 12 as an 
example. Each of them has its own length denoted by 
VL m and VL n . Since the mid-point of each vector is re- 
corded, the distance VD (m, n) between two vectors can 
be measured using the distance between the mid-points of 
the two vectors. Together with the two bond angles BA m 
and BA n as well as the dihedral angle between the two 
vectors DA (m, n), these variables can strictly define how 
two vector or secondary structure elements are arranged spatially. If there is a pair of 
SSEs (SSEP) in another protein structure that display the same geometrical arrange- 
ment, structure similarities may be inferred. With further analysis, the SSE pairs can 
be enlarged to accommodate more secondary structure elements to infer higher-level 
structure similarities. We can think of SSEPs as SSE groups of the finest granularity. 




Fig. 12. Internal coordi- 
nates that represent spatial 
arrangements of a pair of 
vectors 



4 Contributions and Utility of Our Semantic Model 

We believe our model makes significant contributions because it brings out the hid- 
den meaning behind the spatial arrangement of protein structures. This information is 
even more powerful when it is associated with the structure and formation of the 
bonding forces. Our model can facilitate easier data retrieval and analysis by the sci- 
entific community in following ways. 

4.1 Database Integration 

Being able to use structure data effectively and efficiently is the core of structural 
genomics [22] especially when there are number of resources using different data 
models. Ontologies are being developed in the field of bioinformatics to support data 
integration [14]. Most of this work is aimed at describing DNA sequence with par- 
ticular focus on their function and how they react with other biological entities. Not 
much research has been directed at representing the structural complexity of biologi- 
cal data. Our model can serve as a canonical model for unifying different sources of 
protein structure data with all the semantics captured. This would save a lot of time 
and human resource in data curating and enable one-stop shopping. 
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4.2 Revolutionize Structure Search and Comparison 

Current protein structure data is stored in flat files that cannot be easily queried or 
examined. Our model can represent the semantics of the data and facilitate develop- 
ment of user-friendly tools to browse, search, and extract protein structures. For in- 
stance, a model such as ours can be used to ask queries of the following type “ For a 
specific protein with ID 1SFIA, find all of its motifs, and retrieve the structural com- 
position of each motif’. 

Protein structure prediction can be improved or perhaps even speeded up and also 
made easier by deploying our model. Instead of building various software tools and 
developing algorithms to compare and search structure on the top of flat file data, we 
can explore the possibility of extending an object-relational DBMS with specialized 
operators targeted at structure data. This approach is currently being explored by the 
latest Oracle lOg database management system that embeds data mining functionality 
for classification, prediction and association mining. One of the new features is that it 
incorporates the BLAST algorithm supporting sequence matching and alignment 
searches. But there is an obvious difference between the Oracle approach and our 
work presented in this paper. In Oracle lOg, the sequence data is still recorded as a 
text field. The BLAST operator is merely an interface based on flat files without 
capturing the semantics of sequence data. 

As to the application of our proposed model, we can implement VAST or similar 
algorithms based on our semantic model into the DBMS. At the same time, new 
structure search and comparison operators can also be developed to extend SQL 
(Structured Query Language). For example, using our semantic model we capture the 
spatial composition of each secondary structure element in a protein. We can now 
define operators to compare the different SSEs based on their vector variables. Such a 
comparison operator can include thresholds for determining similarity. That is, if the 
vector length, vector angle, vector distance and vector dihedral angle comparison 
results are all within the corresponding thresholds then the SSEPs can be determined 
to be similar. Once such an operator is defined and implemented, we can extend it 
further to easily compare two 3D structures to find out what structures two proteins 
share in common and therefore determine the overall similarity between proteins. 
This will allow scientific users to easily query the data without having to learn 
advanced SQL operators and procedural languages. 

4.3 Utility in Other Fields 

Besides its application in bioinformatics, our semantic model can benefit other related 
fields such as Chemoinformatics [17]. Chemoinformatics is concerned with the appli- 
cation of computational methods to tackle chemical problems, with particular empha- 
sis on the manipulation of chemical structure information. Therefore, it is essential in 
Chemoinformatics that researchers have effective and efficient approaches to store, 
search, and compare large quantities of complicated 2D or 3D structures of various 
chemical compounds. We are planning to extend our model for this field. 
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5 Future Research 

The number of protein structures that are determined or calculated is increasing rap- 
idly. Searching for and comparing structures is currently very complicated and have 
become major obstacles in the development of structural genomics. 

Our work focuses on the semantics of bioinformatics data to understand and model 
3-D protein structures. With this model, we hope to pave the way for standard and 
useful software tools for performing protein structure search more effectively and 
efficiently. We are continuing to extend this model to make it more complete. For 
example, we are elaborating on the chemical forces that contribute to the formation of 
secondary structures beyond hydrogen bonds. Other semantics are also being ex- 
plored including the effects of solvent molecules. Based on our semantic model, a 
relational schema is being developed, and we are proposing new structure comparison 
and search operators. Even though implementation is not the focus of our research, 
we will be developing a prototype system as a proof-of-concept. 
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Abstract. In the current networked world, outsourcing of information technol- 
ogy or even of entire business processes is often a prominent design alternative. In 
the general case, outsourcing is the distribution of economically viable activities 
over a collection of networked organizations. To evaluate outsourcing decision 
alternatives, we need to make a conceptual model of each of them. However, in 
an outsourcing situation, many actors are involved that are reluctant to spend too 
many resources on exploring alternatives that are not known to be cost-effective. 
Moreover, the particular risks involved in a specific outsourcing decision have to 
be identified as early as possible to focus the decision-making process. In this 
paper, we present a risk-driven approach to conceptual modeling of outsourcing 
decision alternatives, in which we model just enough of each alternative to be 
able to make the decision. We illustrate our approach with an example. 



1 Introduction 

Current network technology reduces the cost of outsourcing automated tasks to such 
an extent that it is often cheaper to outsource the automated task than to perform it 
in-house. Automation decisions thereby become outsourcing decisions. In the simplest 
case, outsourcing is the delegation of value activities from one organization to another, 
but in the general case, it is the distribution of a set of value activities over a collec- 
tion of networked organizations. Organization involved in negotiating about outsourc- 
ing must know as early as possible whether some allocation of activities to organizations 
is profitable. This precludes them from the costly modeling of functionality, data, be- 
havior, communication structure and quality attributes of each possible alternative. To 
reduce this cost, a just-enough approach to conceptual modeling of possible solutions 
is needed, which allows a selection among alternatives without elaborate conceptual 
models of each. 

In this paper, we present a risk-driven approach to conceptual modeling of alterna- 
tives in outsourcing decisions. The main advantage of our approach is that it provides, 
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with relatively little effort, a problem structuring of the outsourcing decision. The ap- 
proach itself involves only very simple diagramming techniques, just enough to capture 
the structure of the problem while simple enough to present to stakeholders who do not 
have a background in conceptual modelling. The approach helps in identifying the parts 
of the problem for which we need to develop more detailed conceptual models using 
well-known techniques such as provided by the UML. 

We illustrate our approach by a case study that is introduced in Section 2. The first 
step in our approach is based on the e 3 -value method, which is presented in Section 3. 
The e 3 -value method, which has been developed earlier by the third author, can be used 
to model and evaluate innovative business models for networked businesses (in this pa- 
per: outsourcing options) from the perspective of value created by each participant in a 
networked business. The e 3 -value method is based on accepted conceptual notions from 
marketing, business science and axiology. It deliberately limits the number of modeling 
constructs such that the method is easy to learn and apply in practice. Using e 3 -value , 
outsourcing decision alternatives for the case study are developed in Section 4. In Sec- 
tion 5, we present a systematic approach to identify risks associated with each option, 
which is applied to our case study and discussed in Section 6. Section 7 concludes the 
paper. 

2 The NGO Example 

We illustrate our approach with a real-life example of a collection of European Non- 
Governmental Organizations (NGOs) in the domain of international voluntary service. 
Each NGO sends out volunteers from its own country to projects offered by NGOs 
in other countries (as well as to its own projects) and accepts volunteers from other 
countries in its own projects. The purpose is to create possibilites for learning from other 
cultures and to help in local social development. The NGOs maintain contact with each 
other about projects offered and about volunteers, and there is a supranational umbrella 
organization that loosely coordinates the work of the (independent) NGOs. Some of the 
NGOs receive government subsidies, most do not. In the projects offered, only work is 
done that cannot be performed commercially. 

Each NGO has a web site, a general ledger system for the financial administration, 
a simple workflow management system (WFM) to manage the workflow for match- 
ing each volunteer to a project, a project database of running projects, and a customer 
relationship management system (CRM) to manage information about volunteers that 
have shown interest in voluntary service. Since the NGOs vary widely in age, size and 
level of professionalism, and since they are independent, the implementations of these 
systems also vary widely and do not provide compatible interfaces. Recently, an ap- 
plication service provider (ASP) has offered to handle the WFM/CRM systems of all 
NGOs. The question to be solved is how this can be done such that the ASP makes 
money, while the NGOs are better off in terms of costs, quality, or both and the risks 
associated with the outsourcing solution chosen are manageable, 

3 The e 3 -value Method 

The e 3 -value methodology is specifically targeted at the design of business networks, 
as for example in e-commerce and e-business. Business networks jointly produce, dis- 
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tribute and consume things of economic value. The rapid spread of business networks, 
and of large enterprises that organize themselves as networks of profit and loss respon- 
sible business units, is enabled by the capability to interconnect information systems 
of various businesses and business units. In all cases, the trigger of an application of 
e 3 -value are the networking opportunities perceived to be offered by information and 
communication technology (ICT). The use of e 3 -value is then to explore whether the 
networking idea can really be made profitable for all actors involved. We do so by 
thoroughly conceptualizing and analyzing such a networked idea, to increase shared 
understanding of the idea by all stakeholders involved. The results of an e 3 -value track 
are sufficiently clear to start requirements engineering for software systems. In the fol- 
lowing, we will indicate networks of businesses and networks of business units by the 
blanket term networked enterprises. We will also call the software systems that support 
business processes business systems. Examples of business systems are information 
systems, workflow management systems and enterprise-specific application software. 

Before the requirements on the information technology used by networked enter- 
prises can be understood, the goals of the network itself need to be understood. More 
precisely, before specifying the business systems and communications between these, 
it is important to understand how various enterprises in the network create, distribute 
and consume objects of economic value. The e 3 -value method has been developed in 
a number of action research projects as a method to determine the economic structure 
of a networked enterprise. These are real life projects in which the researcher uses the 
technique together with business partners, followed by a reflection on and improvement 
of the technique. For the business partners, these projects are not research but commer- 
cial projects where they pay for the results. The researcher has the dual aim to do a job 
for the business and to learn something from doing so. 

We illustrate the concepts of e 3 -value using Fig. 1, which shows a value model of 
the current network of NGOs. 

Actor. An actor is perceived by its environment as an independent economic (and often 
also legal) entity. An actor intends to make a profit or to provide a non-profit service. 
In a sound, sustainable business model each actor should be capable of creating a net 




Fig. 1. Value Model of the current NGO network. 
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value. Commercial actors should be able to make a profit, and non-profit actors should 
be able to create a value that in monetary terms exceeds the costs of producing it in 
order to sustain. Each NGO, the umbrella organization, each project and each volunteer 
in our example is an actor. 

Although this example is about non-profit organizations, the arguments to enter the 
network of cooperating NGOs are stated in terms of value added and costs saved by 
the members in this cooperation. This makes e 3 -value a useful technique to solve the 
business problem whether a cooperation can be organized in such a way that value is 
added for all concerned. 

Value Object. Actors exchange value objects, which are services, products, money, or 
even consumer experiences. A value object is of value to at least one actor. In Fig. 1, 
Assigned volunteer and Assigned project are value objects. 

Value Port. An actor uses a value port to show to its environment that it wants to provide 
or request value objects. A value port has a direction, namely outbound (e.g. a service 
provision) or inbound (e.g. a service consumption). A value port is represented by a 
small arrowhead that represents its direction. 

Value Transfer. A value transfer connects two equidirectional value ports of different 
actors with each other. It is one or more potential trades of value objects between these 
value ports. A value transfer is represented by a line connecting two value ports. Note 
that a value transfer may be implemented by a complex business interaction containing 
data transmissions in both directions [1]. The direction of a value transfer is precisely 
that: the direction in which value is transfered, not the direction of data communications 
underlying this transfer. 

Value exchange. Value transfers come in economic reciprocal pairs, which are called 
value exchanges. This models ‘one good turn deserves another’ : you offer something to 
someone else only if you get adequate compensation for it. 

Value Interface. A value interface consists of ingoing and outgoing ports of an actor. 
Grouping of ingoing and outgoing ports model economic reciprocity: an object is deliv- 
ered via a port, and another object is expected in return. An actor has one or more value 
interfaces, each modelling different objects offered and reciprocal objects requested in 
return. The exchange of value objects across one value interface is atomic. A value 
interface is represented by an ellipsed rectangle. 

Market segment. A market segment is a set of actors that, for one or more of their value 
interfaces, ascribe value to objects in the same way from an economic perspective. 
Naturally, this is a simplification of the real world, but choosing the right simplifications 
is exactly what modeling is about. A market segment is represented by a stack of actor 
symbols. NGOs is an example of such a market segment. 

With the concepts introduced so far, we can describe who exchanges values with whom. 
If we include the end consumer as one business actor, we would like to show all value 
exchanges triggered by the occurrence of one end-consumer need. This considerably 
enhances a shared understanding of the networked enterprise idea by all stakeholders. 
In addition, to assess the profitability of the networked enterprise, we would like to do 
profitability computations. But to do that, we must count the number of value exchanges 
triggered by one consumer need. To create an end-consumer need and do profitability 
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computations, we include in the value model a representation of dependency paths be- 
tween value exchanges. A dependency path connects value interfaces in an actor and 
represents triggering relations between these interfaces. A dependency path has a direc- 
tion. It consists of dependency nodes and connections. 

Dependency node. A dependency node is a stimulus (represented by a bullet), an AND- 
fork or AND-join (short line), an OR-fork or OR-join (triangle), or an end node (bull’s 
eye). As explained below, a stimulus represents a trigger for the exchange of economic 
value objects, an end node represents a model boundary. 

Dependency connection. A dependency connection connects dependency nodes and 
value interfaces. It is represented by a link. 

Dependency path. A dependency path is a set of connected dependency nodes and con- 
nections with the same direction, that leads from one value interface to other value 
interfaces or end nodes of the same actor. The meaning of the path is that if a value 
exchange occurs across a value interface I , then value interfaces pointed to by the path 
that starts at interface I are triggered according to the and/or logic of the dependency 
path. If a branch of the path points to an end node, then this says “don’t care”. 

Dependency paths allow one to reason about a network as follows: When an end con- 
sumer generates a stimulus, this triggers a number of value interfaces of the consumer 
as indicated by the dependency path starting from the triggering bullet inside the con- 
sumer. These value interfaces are connected to value interfaces of other actors by value 
exchanges, and so these other value interfaces are triggered too. This in turn triggers 
more value interfaces as indicated by dependency paths inside those actors, and so on. 

Our value model now represents two kinds of coordination requirements: Value 
exchanges represent the need to coordinate two actors in their exchange of a value 
object, and dependency paths indicate the need for internal coordination in an actor. 
When an actor exchanges value across one interface, it must exchange value across all 
value interfaces connected to this interface. This allows us to trace the value activities 
and value exchanges in the network triggered by a consumer need, and it also allows 
us to estimate profitability of responding to this need in this way for each actor. For 
each actor we can compute the net value of the value objects flowing in and those 
flowing out according to the dependency path. The concept of a dependency path is 
reminiscent to that of use case maps [2], but it has a different meaning. A use case 
map represents a sequential scenario. Dependency paths represent coordination of value 
interfaces, and dependency paths in different actors may among each other not have an 
obvious temporal ordering, even if triggered by the same stimulus. 



4 Example: Outsourcing Options for the NGO’s 

4.1 Current Value Model 

In order to explore possibilities for outsourcing, we first discuss the current value model 
of the NGOs as presented in Fig. 1 . The diagram shows the NGO market segment twice, 
because we want to show that there exists interaction between NGOs. An NGO serves 
two types of actors: Volunteers and projects. The task of a NGO is to match a volunteer 
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to a project. If a match is successful, the project obtains a volunteer, and a volunteer 
obtains a project. Both the volunteer as well as the project pay a fee for this service. 
Volunteers need a project to work for; Projects need volunteers. These needs are shown 
in Fig. 1 by stimuli. 

The match itself is represented as an AND-join. Following the paths connected to 
the join, it can be seen that for a match, a volunteer and a project is needed. These 
volunteers and projects can be obtained from the NGO’s own customer base, or can be 
obtained from other NGO’s as is represented by OR-joins. 

Note that Fig. 1 shows only part of the dependency path. Specifically we represent 
that for matching purposes the rightmost NGO uses volunteers and projects from its 
own base or from other NGO’s. However, the leftmost NGO’s do also matching. Paths 
associated with these matchings are not presented. We skip the profitability estimations 
for this example, because these play no role in the following argument. The interested 
reader can find examples in earlier publications [3,4], The e 3 -value method includes 
tools to check well-formedness of models and to perform profitability analysis. 

4.2 Option (1): ICT Outsourcing 

A main concern for NGOs and the umbrella organization is to have cost-effective ICT 
support for their processes, while preserving or improving the quality of service offered 
to volunteers. Specifically, NGOs have indicated that the different WFM and CRM 
systems present in the NGOs are candidates for cost-cutting operations. We saw in our 
current problem analysis that each NGO exploits its own WFM and CRM. One option 
for cost-cutting is therefore to replace all these WFM and CRM systems by one system, 
to be used by all NGOs. This system can be placed at the unbrella organization, who 
then acts as an Application Service Provider (ASP). This means that NGOs connect to 
the Internet, and use the WFM and CRM system owned by the umbrella organization. 
To keep costs low, NGOs use a browser to interact with the WFM and CRM system of 
the umbrella. This leads to the value model in Fig. 2. 

The exchanges introduced in Fig. 1 remain intact. The umbrella organization acting 
as ASP is introduced in the value model. In the value model we see that the ASP of- 
fers a matching service, i.e. the ASP offers an information system with the same main 
functionality as the old WFM and CRM application. Each NGO still has to perform the 
matching function (using the information system offered by the ASP). Thus, this is a 
case of IT outsourcing but not of business process outsourcing (BPO) This implies that 
the NGO interacts from a value perspective exactly the same as in Fig. 1 . 

4.3 Option (2): Business Process Outsourcing 

A second option is to outsource the matching function itself to the umbrella organiza- 
tion (business process outsourcing, which includes ICT outsourcing). Fig. 3 show the 
value model of this. The matching is now done for all NGOs using the same base of 
volunteers and projects. This allows for doing global matching, rather than doing lo- 
cal matching for each NGO separately. In this solution, there is a drastic change in the 
value exchanges: Each NGO pays for a match to the umbrella organization. The role 
of a NGO is not so much the matching itself, but attracting volunteers and projects 
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in their specific region. So, exchanges between NGOs disappear. They exchange now 
value objects using the umbrella organization as an intermediate. 

5 Concerns and Risks 

In order to implement a value model, we need to model business processes, information 
manipulated by these processes, and other aspects of the technology support of the 
model. To prevent us from spending a lot of time on models that will not be used after 
an outsourcing option is chosen, we identify current business goals that will be used to 
disciminate different options. The goals are identified by listing current business issues, 
as illustrated in table 1 . This table is explained later. Each outsourcing option will be 
evaluated with respect to these goals. 

Furthermore, we will use a concern matrix, that lists all relevant system aspects that 
we possibly would want to model, and set this off against the major cost factor of each 
option, namely maintenance. Table 2 shows a concern matrix, that we will discuss later. 
We use a concern matrix to identify the risks asssociated with each option, where a risk 
is the likelihood of a bad consequence, combined with the severity of that consequence. 
Each cell in the concern matrix is evaluated by asking (i) what the risk is that this option 
cannot be realized, and (ii) what the risk is that the option under consideration will not 
achieve the business goals in this area. 

The concern matrix allows us to reduce conceptual modeling costs in two ways. 
First, it prevents us from modeling in detail options that will not be chosen, and second, 
for the chosen option it will point us at aspects that need not be modelled because no 
risk is associated with them. Use of the issue/goal list and of the concern matrix are 
two tools in our method engineering approach to conceptual modeling. They allow us 
to select system aspects for which we will make conceptual models. 
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Fig. 3. Value Model for a business process outsourcing solution. 



We now explain the two dimensions of the concern matrix in more detail. The hor- 
izontal dimension of the matrix distinghuishes five general aspects of any system. The 
universality of these aspects is motivated extensively in earlier publications [5, 6]. The 
relevance of these aspects follows from the fact that any specification of a system to 
be outsourced, must specify these aspects. Note that in this paper, a system equals an 
outsourcing option, i.e, outsourced ICT and/or business processes together with their 
context in an organization. We now briefly explain the system aspects. 

- The services (or functions) provided by the system; 

- The data (or information) processed and provided by the system; 

- The behaviour of a system: the temporal order of interactions during delivery of 
these services. 

- Communication: the communication channels through which the system interacts 
with other systems during service delivery. 

- The composition of the system in terms of subsystems; 

Our earlier publications [5, 6] distinguish a sixth aspect: the non-functional or qual- 
ity aspect. In this paper, this aspect consists of attributes of the other aspects. 

The vertical dimension of our concern matrix consists of several types of mainte- 
nance. Maintenance in this paper is defined as all activities that need to be performed 
to manage, control and maintain the ICT systems and procedures of an organization. 
Maintenance in this sense is also called IT service management. We need to consider 
maintenance because this embodies most of the costs of the entire system costs, and 
therefore contains most of the risk of an outsourcing option. By the same token, de- 
sign and implementation of ICT systems (i.e., the work done by software engineers) is 
only a small part of the entire system cost and therefore contain only a small part of 
the risk of a design alternative. In addition, in the context of outsourcing, design and 
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implementation are even less relevant: If existing ICT systems are outsourced, design 
and implementation have already been completed; if new business processes are out- 
sourced, design and implementation of ICT systems that support these processes is the 
responsibility of the organization to which these processes are outsourced. 

The maintenance dimension distinghuises three kinds of maintenance, namely func- 
tional, application and infrastructure [7], explained next. 

- Functional maintenance consists both of maintenance of the set of services that an 
information system provides, as well as of supporting users in getting the most out 
of the set of services offered. This involves providing some form of helpdesk and 
user handholding, but also personnel and procedures to collect user requirements 
and turn them into a specification of required services. Functional maintenance is a 
responsibility of the user organization and is often performed by non-IT personel. 
Some of the users are partially freed from their normal duties and instead are given 
the task to help other users, perform acceptance tests, etc. 

- Application maintenance consists of maintenance of the software that implements 
an information system (as well as, to a lesser extent, user support, e.g. providing a 
third-line helpdesk). Application maintenance is carried out mostly by IT personel, 
specifically programmers. Tasks include fixing bugs, implementing new functions, 
version and release management. ASL ( Application Service Library) is a standard 
process model for application maintenance [8], 

- Infrastructure maintenance comprises all tasks needed to provide the computer and 
networking infrastructure needed for the information systems of an organisation to 
run: configuration management, capacity management, incident management (in- 
cluding user support). ITIL (IT Infrastructure Library is a standard process model 
for application maintenance [9], 

The maintenance dimension contains maintenance aspects of the outsourcing op- 
tions, not of maintenance itself. This means for instance that the behaviour aspect is in- 
volved with the processes an outsourced system executes and not with processes needed 
for maintenance such as described by ASL and ITIL. 



Table 1. NGO ICT policy goals and the issues they address. 



Issue 


Goal 


Low quality of end-user support (personnel in pri- 
mary processes) 


G1 Improve end-user support 


Functionality inadequate due 
to long application release 
turnarounds: 

- Ad-hoc queries 

- Information exchange with 
other applications 


Improve speed of adding new 
functionality: 

G2a Ad-hoc queries can be 
performed with short 

turnaround time. 

G2b Information exchange func- 
tions with other applications 
can be provided in months, 
not years. 


Data pollution 


G3 Improve means for cleaning up data, either by 
end users or by databases administrator 


Possible future European consolidation 


G4 Robustness w.r.t. future changes in funda- 
mental way of working 
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Table 2. Concerns matrix for NGO. Changes in risks as opposed to the current situation. All 
entries name issues that lead to increased risk unless indicated otherwise. 





Services 


Data 


Behaviour 


Communication 


Composition 


Functional main- 
tenance 


Options 1 & 2: 
helpdesk quality 

(G1) 


Options 1 & 2: 
Ad hoc queries 

(G2a) 


Option 1 : none 
option 2: 
fundamental 
change in way of 
working 
Both: G4 






Application main- 
tenance 


Options 1 & 2: 
quality and avail- 
ability of adap- 
tive and correc- 
tive maintenance 
(G2b) 


Options 1 & 2: 
DBA availability 
and quality (G3) 




Options 1 & 2: 
interfaces with 
systems that are 
not outsourced 

(G2b) 




Infrastructure 

maintenance 


Options i £ 
2: ASP net- 

work connection 
quality 


Options 1 & 2: no 
DBMS needed 
(decreased risk) 









6 Example: NGO Outsourcing Concerns 

Table 1 lists the goals with respect to which we will decide what concerns us in the 
outsourcing options. For each cell in the matrix, we ask whether it will help bring us 
closer to the goals, and what the risk is that it take us farther away from the goals. The 



resulting concern matrix is shown in Table 2 

6.1 The Behaviour Aspect 

The current business processes operating i 

• Core processes 

— acquisition of own projects 

— acquisition of own volunteers 

— matching: 

— placement of incoming 
volunteers in own projects 

— placement of own volun- 
teers in projects of other 
NGOs 

— volunteer preparation (train- 
ing) 



. We now discuss the columns of the matrix. 



the NGOs are these: 

• Management of the network of 
NGOs 

— Entry and exit of an NGO in 
the NGO network 

• Financial processes 

• 1CT support processes 

• ICT management processes 

• FIRM processes 

• Controlling processes 

— policy making 

— quality control 

— incident response 



At this moment, there is no need to elaborate this simple conceptual model of business 
processes, because we can already see what is the issue. In option (1), the ASP solution, 
has no impact on the business processes. However, in option (2), the BPO solution, one 
of the core processes (matching) no longer has to be executed. We can now ask the 
NGOs to decide whether this is good (more time to focus on project and volunteer 
acquisition and preparation) or bad (loss of strategic advantage). 

Looking at our list of goals (table 1), we see that a second question to the NGO’s is 
whether Options 1 and 2 facilitate possible future European consolidation. 
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6.2 The Communication Aspect 

Fig. 4 shows the currently available business systems in each NGO. The figure shows a 
number of business systems in NGOl. Each system consists of people and technology, 
such as software, paper, telephones, fax, etc. Each NGO has systems with the same 
functionality, but different NGOs may use different people-technology combinations 
to implement this functionality. The diagram also shows a number of communication 
channels between these systems. We have labeled them to be able to refer to them below. 
Each communication channel is a means to share information between two actors. The 
meaning of the diagram is that each channel is reliable and instantaneous; if we want to 
include an unreliable communication channel with delays, we should include the chan- 
nel as system in the diagram and connect it with lines to the systems communicating 
through the channel; the remaining lines then represent reliable instantaneous commu- 
nication channels. The WFM system of NGOl communicates with WFM systems in all 
other NGOs through channel E. Not shown is the fact that the communication between 
WFMs of different NGOs currently is done mostly by telephone, fax, email and paper 
mail. Fig. 4 also shows the context of NGOl, which consists of volunteers and projects 
(and other NGOs). 

Option (1), the ASP solution, impacts the technology situation, as shown in the con- 
text diagram of Fig. 5(a). From an ICT perspective, there is now only one WFM/CRM 
application (instead of many different ones), but there are still as many instances of it 
as there were applications in the old situation, only they are now provided by one party, 
and they are all exactly the same. By doing so, the umbrella organization can exploit 
economies of scale and thus yield a more cost-effective ICT service for the NGOs. In- 
terface E is now simplified because it is an interface between different instances of the 
same system. However, the other interfaces need to be redesigned, as the WFM/CRM 
application offered by the ASP is most probably different from the one the NGO used 
before. This means that either the ASP or each NGO has to manage integration mid- 
dleware. Cross-organizational integration of enterprise applications is a relatively new 
phenomenon and is known to be complicated. Thus, the need for this technology adds 
considerable risk to outsourcing. The NGOs use the WFM/CRM applications exactly 
in the same way as before; their business processes do not have to change. 

From an ICT perspective, option (2), the business process outsourcing solution, 
has one matching (WFM/CRM) system for all NGOs (Fig. 5(b)). Interface E now dis- 




Fig.4. Communication diagram of current situation. 
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(a) ASP solution. (b) Outsourcing solution. 



Fig. 5. Communication diagrams of software systems in the ASP and the outsourcing solutions. 



appears. However, as in the ICT outsourcing option, the other interfaces have to be 
adapted. 

6.3 The Services Aspect 

Functional support/maintenance. The question in both outsourcing options is how 
user support (‘handholding’) is organised. One possibility is that the ASP provides a 
first- line helpdesk, either as part of a package deal (fixed price), or billed per incident. 
In this case, each NGO has to ask itself whether it thinks the ASP knows enough of this 
NGO to actually by able to understand user questions and respond to them in a helpful 
way (language is an issue here as well). For the ASP, this helpdesk is a new value 
object that can be added to the value models in Fig. 2 and Fig. 3 to get a more complete 
model. If the ASP does not offer a first-line helpdesk, each NGO has to appoint someone 
(most probably a ’power user’) to provide support for other users. This person is then 
supported by a second-line helpdesk provided by the ASP. 

Application support/maintenance. In application maintenance, often a distinction is 
made between corrective maintenance (fixing bugs, no new functions) and adaptive 
maintenance (implementing new user needs by adapting or building new functions). 

- Corrective maintenance is equivalent with fixing bugs. It can be expected that in 
both options, the ASP is responsible for this. The NGO needs to convince itself 
that the ASP is up to this task. 

- Adaptive maintenance. It can be expected that each NGO from time to time needs 
new functions. The ASP may provide a service that consists of building new func- 
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tions in the application provided, for instance billed by the hour or at a fixed, pre- 
negotiated price (this is a value object that the ASP may offer). The ASP may 
also use a collaborative, open-source based method. The ASP may also not offer 
adaptive maintenance. In this case all added functionality has to be implemented 
outside of the application offered by the NGOs themselves, which has implications 
for interfaces A-D. 

Infrastructure support/maintenance. Infrastructure support/maintenance does not 
change significantly for the NGOs if the CRM/WFM application is outsourced to the 
ASP. Each NGO still needs to provide a local infrastructure to its personel consisting 
of workstations, a local area network, operating systems and personal productivity soft- 
ware. If the CRM/WFM application is outsourced, the NGO no longer needs to provide 
e.g. a database or application server (assuming it was only used for the CRM/WFM 
application), but maintenance of the Internet uplink becomes more important as it is a 
single point of failure: If it is unavailable, the outsourced application cannot be used. 

6.4 The Data Aspect 

Functional support/maintenance. The issue here are ad-hoc queries. From time to 
time, each NGO may want to do some one-time analysis of its data. (A realistic example 
is checking whether the NGO qualifies for a certain form of subsidy, e.g. related to the 
average age of its volunteers.) Strictly speaking, this belongs to the function aspect, as 
it requires a new function. In practice, however, it is not possible to treat a one-time 
analysis as a new function: there is not enough time to wait for a new release. The 
ASP may offer a kind of extended datatbase administration service that can run ad-hoc 
queries, or provide data-level access to the data sets to NGO, which would require a 
new interface next to A-D. 

Application support/maintenance 

- Information aspect, corrective maintenance. It is widely known that each and every 
data set sooner or later gets polluted with incorrect data. It may be the case that the 
application offered by the ASP provides a set of functions that enables end users 
to always manipulate all data, no matter what happened to it. It is perhaps more 
realistic to assume that every now and then, a database administrator is needed 
to correct things at the database level, either because no function is available for 
certain corrective actions, or it is more efficient (bulk updates). The ASP may offer 
a database administrator, either as part of a package deal (fixed price), or billed 
by the hour. The ASP may also decide to offer access at the database level to the 
NGOs, or both. For each NGO, this means that it has to decide whether to perform 
database maintenance itself, or buy it from the ASP. 

- Information aspect, adaptive maintenance. This refers to changing the database 
scheme and most often also requires adapting existing functions to do something 
useful with the new scheme. Therefore, the same considerations hold as for the 
function aspect, adaptive maintenance. 
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Infrastructure support/maintenance. The NGO is able to save some costs as a data- 
base management system for the WFM and CRM applications is no longer needed. 

6.5 Discussion 

The concern matrix identifies issues to be taken into consideration when choosing be- 
tween options. It also identifies aspects of the chosen option to be elaborated in concep- 
tual models. So it saves us work in two ways: (1) It prevents us from detailed conceptual 
modeling when choosing options and (2) it prevents us from modeling all aspects of an 
option once chosen. 

Identification of possible outsourcing options (Section 4) and the risks associated 
with them (Section 6) enabled the NGO to focus its internal decision process as well as 
discussions with the ASP provider. So far, none of the options have been found to be 
unacceptable. The next step for the NGO is to further elaborate the options by designing 
high-level models of support processes and estimating the costs associated with them. 
Moreover, the NGO needs to look deeper into the interfaces needed with the outsourced 
application. This will involve modelling the data exchanged to get an idea of the effort 
needed to design these interfaces. The ASP may, based on discussions with the NGOs, 
further extend its offerings, which can be modelled with additional e 3 -value models. 

7 Conclusions 

We presented an approach to quickly identify alternatives for outsourcing decisions and 
the risks associated with them, using a few simple diagramming techniques (value mod- 
els, a bulleted list as a process model, and communication diagrams). The main value 
of this approach is that it provides, with relatively little effort, insight into the struc- 
ture of the outsourcing problem at hand. This insight is needed to identify the parts of 
the problem that warrant more detailed conceptual modelling efforts using well-known 
techniques such as entity-relationship modelling. The problem structure also quickly 
reveals enterprise application integration (EAI) problems introduced by outsourcing. 

We plan to further develop our value-based approach to design and analysis of e- 
business systems. This involves for instance systematic ways of deriving business pro- 
cesses from a value model [1]. Furthermore, we plan to investigate the relation between 
our approach and Quality Function Deployment (QFD/House of Quality) [10]. QFD 
provides a systematic way to compare alternative solutions to a design problem with 
respect to quality attributes. We think that our approach can be used to identify which 
quality attributes are important in a given outsourcing problem, as well as to identify 
alternative solutions. In this way, QFD may be usable as an extension to our approach. 
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Abstract. In this paper an approach for building process models for e-com- 
merce is proposed. It is based on the assumption that the process modeling task 
can be methodologically supported by a designers assistant. Such a foundation 
provides justifications, expressible in business terms, for design decisions made 
in process modeling, thereby facilitating communication between systems de- 
signers and business users. Two techniques are utilized in the designers assis- 
tant, namely process patterns and action dependencies. A process pattern is a 
generic template for a set of interrelated activities between two agents, while an 
action dependency expresses a sequential relationship between two activities. 



1 Introduction 

Conceptual models have become important tools for designing and managing com- 
plex, distributed and heterogeneous systems, e.g. in e-business and e-commerce, [2, 
17]. In e-commerce it is possible to identify two basic types of conceptual models: 
business models and process models. A business model focuses on the what in an e- 
commerce system, identifying agents, resources, and exchanges of resources between 
agents. Thus, a business model provides a high-level view of the activities taking 
place in e-commerce. A process model, on the other hand, focuses on the how in an e- 
commerce system, specifying operational and procedural aspects of business commu- 
nication. The process model moves into a more detailed view on the choreography of 
the activities carried out by agents. 

A business model has a clearly declarative form and is expressed in terms that can 
be easily understood by business users. Therefore, business models function well for 
supporting communication between systems designers and business users. In contrast, 
a process model has a more procedural form and is at least partially expressed in 
terms, like sequence flows and gateways, that are not immediately familiar to busi- 
ness users. Furthermore, it is often difficult to understand why a process model has 
been designed in a certain way and what consequences alternative designs would 
have. In order to overcome these limitations, we believe that process models should 
be complemented by and be based on a more declarative foundation. Such a founda- 
tion would provide justifications, expressible in business terms, for design decisions 
made in process modeling, thereby facilitating communication between systems de- 
signers and business users. In this paper, we propose a designers assistant that pro- 
vides a declarative foundation for process modeling suggests a method for gathering 
domain knowledge. The work reported in this paper extends the work of [1] and [10] 
in that we propose two instruments for a declarative foundation of process models: 
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process patterns and action dependencies. A process pattern is a generic template for a 
set of interrelated activities between two agents, while an action dependency ex- 
presses a sequential relationship between two actions. 

The rest of the paper is organized as follows. Section 2 presents the notions of 
business models and process models. Section 3 introduces process patterns and makes 
a distinction between transaction patterns and collaboration patterns. Section 4 dis- 
cusses action dependencies. Section 5 proposes a designers assistant that supports a 
designer in the construction of a process model. Section 6 concludes the paper and 
gives suggestions for further research. 



2 Business Models and Process Models 

For illustrating business and process models, a small running case is introduced. It is a 
simplified version of the Drop-Dead Order business case described in [8]. In this 
business scenario, a Customer requests an amount of fried chicken from a Distributor. 
The Distributor then requests formal offers from a Chicken Supplier and a Carrier. 
Furthermore, the Distributor requests a down payment from the Customer before 
accepting the offers from the Chicken Supplier and Carrier. As the Customer com- 
pletes the down payment to the Distributor, the Distributor accepts the offer from the 
Chicken Supplier by also paying a down payment and the offer from Carrier. When 
the Chicken Supplier has provided the fried chicken and the Carrier has delivered 
them to the Customer, the Distributor has thereby fulfilled the Customer’s order. 
After that, the Customer settles the final payment to the Distributor. Finally, the Dis- 
tributor settles the Chicken Supplier’s final payment and the payment for the Carrier. 

2.1 Business Models 

As a foundation for business models, we will use the REA ontology [13], which has 
been widely used for business modeling in e-Commerce, [17]. The REA framework is 
based on three main components: Resources, Economic Events, and Agents, see Fig. 
I 1 . An Agent is a person or organization that is capable of controlling Resources and 
interacting with other Agents. A Resource is a commodity, e.g. goods or services that 
is viewed as being valuable by Agents. An Economic Event is the transfer of control 
of a Resource from one Agent to another one. Each Economic Event has a counter- 
part, i.e. another Economic Event that is performed in return and realizing an ex- 
change. For instance, the counterpart of a delivery of goods may be the payment of 
the same goods. This connection between Economic Events is modeled through the 
relationship Duality. 

Furthermore, a Commitment is a promise to execute a future Economic Event, for 
example fulfilling an order by making a delivery. The Duality between Economic 
Events is inherited by the Commitments, where it is represented by the association 
Reciprocal. In order to represent collections of related Commitments, the concept of 
Contract is used. A Contract is an aggregation of two or more reciprocal Commit- 
ments. An example of a Contract is a purchase order composed of one or several 



1 Due to space restrictions and for the purpose of readability we use abbreviated forms of the 
terms in the original REA ontology. This is done by dropping the term ‘Economic’ for Eco- 
nomic Contract, Economic Commitment, Economic Resource, and Economic Agent. 
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order lines, each one representing two Commitments (the goods to be delivered and 
the money to be paid for the goods, respectively). 




Fig. 1 . REA basis for business models 




A business model based on REA will consist of instances of the classes Resource, 
Economic Event and Agent as well as the associations between these. The business 
model for the running case described above can be visualized as in Fig. 2. Here, ar- 
rows represent Economic Events labeled with relevant Resources. The transfer of 
resource control from one Agent to another is represented by the direction of arrows. 
Ellipses represent relationships between Economic Events belonging to the same 
Duality. 

2.2 Process Models 

The notation we will use for process models is BPMN [4], a standard developed by 
the Business Process Management Initiative (BPMI) [3]. The goal of BPMN is to be a 
easily comprehensible notation for a wide spectrum of stakeholders ranging from 
business domain experts to technical developers. A feature of BPMN is that BPMN 
specifications can be readily mapped to executable XML languages for process speci- 
fication such as BPEL4WS, [2]. 

In this paper, a selected set of core elements from BPMN have been used. These 
elements are Activities, Events, Gateways, Sequence flows, Message flows, Pools and 
Lanes. Activity is a generic term for work that an Agent can perform. In a BPMN 
Business Process Diagram (abbreviated BPMN diagram), an Activity is represented by 
a rounded rectangle. Events, represented as circles, are something that “happens” 
during the course of a business process. There exist three types of Events: Start, End 
and Intermediate Events. Activities and Events are connected via Sequence Flows that 
show the order in which Activities will be performed in a process. Gateways are used 
to control the sequence flows by determining branching, forking, merging, and join- 
ing of paths. In this paper we will restrict our attention to XOR and AND branching, 
graphically depicted as a diamond with an ‘X’ or a *+’, respectively. Lanes and Pools 
are graphical constructs for separating different sets of Activities from each other. A 
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Lane is a sub-partition within a Pool used to organize and categorize Activities. Mes- 
sage flows depicted as dotted lines are used for communication between Activities in 
different Pools. (An example of them appear later in Fig. 12) 

An example of a BPMN diagram is shown in Fig. 3. The diagram shows a single 
Business Transaction in one pool with three lanes. A Business Transaction is a unit of 
work through which information and signals are exchanged (in agreed format, se- 
quence and time interval) between two Agents [17]. A Business Transaction consists 
of two Activities, one Requesting Activity where one Agent initiates the Business 
Transaction and one Responding Activity where another Agent responds to the Re- 
questing Activity. (See Fig. 4) 
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Fig. 3. Example of a BPMN diagram 



Several Business Transactions between two Agents can be combined into one bi- 
nary Business Collaboration. It turns out that it is often fruitful to base binary Busi- 
ness Collaborations on Dualities, i.e. one Business Collaboration will contain all the 
Business Transactions related to one Duality. This gives a starting point for construct- 
ing a process model from a business model. Each Duality in the business model gives 
rise to one binary Business Collaboration, graphically depicted as a BPMN diagram in 
a Pool. In this way, a process model will be constructed as a set of interrelated Busi- 
ness Collaborations. 

Furthermore, a binary Business Collaboration can naturally be divided into a num- 
ber of phases. Dietz, [6], distinguishes between three phases. The Ordering phase, in 
which an Agent requests some Resource from another Agent who, in turn, promises to 
fulfill the request. The Execution phase, in which the Agents perform Activities in 
order to fulfill their promises. The Result phase, in which an Agent declares a transfer 
of Resource control to be finished, followed by the acceptance or rejection by the 
other Agent. The ISO OPEN-ED1 initiative [15] identifies five phases: Planning, Iden- 
tification, Negotiation, Actualization and Post-Actualization. In this paper, we use 
only two phases: a Contract Negotiation phase in which contracts are proposed and 
accepted, and an Execution phase in which transfers of Resources between Agents 
occur and are acknowledged. In the next section, we will discuss how a binary Busi- 
ness Collaboration can be constructed utilizing patterns for these phases. 



3 Generic Process Patterns 

Designing and creating business and process models is a complicated and time con- 
suming task, especially if one is to start from scratch for every new model. A good 
designer practice to overcome these difficulties is, therefore, to use already proven 
solutions. A pattern is a description of a problem, its solution, when to apply the solu- 
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tion, and when and how to apply the solution in new contexts [11]. The significance 
of a pattern in e-commerce is to serve as a predefined template that encodes business 
rules and business structure according to well-established best practices. In this paper 
such patterns are expressed as BPMN diagrams. They differ from the workflow pat- 
terns of [18], [16], [19] by focusing primarily on communicative aspects, while con- 
trol flow mechanisms are covered on a basic level only. 

In the following sub sections, a framework for analyzing and creating transaction- 
and collaboration patterns is proposed. We hypothesize that most process models for 
e-commerce applications can be expressed as a combination of a small number of 
these patterns. 



3.1 Modeling Business Transactions 

When a transaction occurs, it typically gives rise to effects, i.e. Business Entities like 
Economic Events/Contracts/Commitments are effected (created, deleted, cancelled, 
fulfilled). Furthermore, the execution of a transaction may cause the desired effect to 
come into existence immediately, or only indirectly, depending on the intentions of 
the interacting Agents. For example, the intention of an Agent in a transaction may be 
to propose a Contract, to request a Contract or to accept a Contract. In all three cases 
the business entity is the same (a Contract) but the intention of the Agent differs. 




Fig. 4. Business Transaction analysis 

Fig. 4 builds on REA and suggests a set of Business- Intentions, Effects and Enti- 
ties. These notions are utilized in defining transaction patterns and transaction pattern 
instances as follows. 

Definition: A transaction pattern (TP) is a BPMN diagram with two Activities, one 
Requesting Activity and one Responding Activity. Every Activity has a label of the 
form <lntention, Effect, Business Entity>, where Intention e [Request, Propose, 
Declare, Accept, Reject, Acknowledge], Effect e [create, delete, cancel], and 
Business Entity e [aContract, anEconomicEvent, aCommitment}. All End 
Events are labeled according to the Intention and Business Entity of the Activity 
prior to the sequence flow leading to the End Event. 

Intuitively, the components of an activity label mean the following: 

• Business Entity tells what kind of object the Activity may effect. 

• Effect tells what kind of action is to be applied to the Business Entity - create, de- 
lete or cancel. 

• Intention specifies what intention the business partner has towards the Effect on the 
Business Entity. 
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The meanings of the intentions listed above are as follows: 

• Propose - someone offers to create, delete or cancel a Business Entity. 

• Request - someone requests other Agents to propose to create, 

• delete or cancel a Business Entity. 

• Declare - someone unilaterally declare a Business Entity created, deleted or can- 
celled. 

• Accept/Reject - someone answers a previously given proposal. 

• Acknowledge - someone acknowledges the reception of a message. 

Definition: A pattern instance of a transaction pattern is a BPMN diagram derived 
from the pattern by renaming its Activities, replacing each occurrence of aContract in 
an activity label with the name of a specific Contract, replacing each occurrence of 
anEconomicEvent in an activity label with the name of a specific EconomicEvent, and 
replacing each occurrence of aCommitment in an activity label with the name of a 
specific Commitment. 

3.2 Transaction Patterns (TPs) 

In the following sections three basic Contract Negotiation and two Execution TPs are 
suggested based on the framework described above. 

3.2.1 Contract Negotiation TPs 

The Contract-Offer TP models one Agent proposing an offer (<propose. Create, aC- 
ontract>) to another Agent who acknowledges receiving the offer. The acceptance or 
rejection of an offer is modeled in the Contract- Accept/Reject TP, see Fig. 5. 
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Fig. 6. TP for Contract Negotiation: Contract-Request 



Fig. 6 models the Contract Request case where an Agent requests of other Agents 
to make an offer for aContract on certain Resources. 
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3.2.2 Execution TPs 

We introduce two Execution TPs (see Fig. 7) that specify the execution of an Eco- 
nomic Event, i.e. the transfer of Resource control from one Agent to another. An ex- 
ample is a Chicken Distributor selling Chickens (a Resource) for $3 (another Re- 
source). 
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Fig. 7. TPs for Execution: Economic Event Offer and Economic Event Accept 



3.3 Assembling Transactions Patterns into Collaboration Patterns 

An issue is how to combine the transaction patterns described in the previous section, 
i.e. how to create larger sequences of patterns. For this purpose, collaboration patterns 
define the orchestration of Activities by assembling a set of transaction patterns and/or 
more basic collaboration patterns based on rules for transitioning from one transac- 
tion/collaboration to another. 

To hide the complexity when TPs are combined into arbitrarily large collaboration 
patterns, we use a layered approach where the TPs constitute activities in the BPMN 
diagram of the collaboration patterns. 

Definition: A collaboration pattern (CP) is a BPMN diagram where the activities 
consist of transaction and collaboration pattern(s). A CP has exactly two end events 
representing success or failure of the collaboration, respectively. All end events are 
labeled according to the Intention and Business Entity of the Activity prior to the se- 
quence flow that led to the end event. 



3.3.1 Contract Negotiation CPs 

The Contract Establishment CP, see Fig. 8, is assembled from the Contract-Offer and 
Contract-Accept/Reject TPs. An example scenario is a Chicken Distributor proposing 
an offer to a customer on certain terms. The contract is formed (or rejected) by the 
customers acceptance or rejection of the proposed offer. 
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Fig. 8. Contract Establishment CP 



The two recursive paths when a contract offer/request has been rejected have a 
natural correspondence in the business negotiation concepts ‘Counter Offer’ and 
‘Bidding’ (or ‘Auctioning’) respectively. ‘Counter Offer’ refers to the switch of roles 
between Agents, i.e. when the responding Agent has rejected the requesting Agents 
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offer, the former makes an offer of her own. ‘Bidding’ is modeled via the other se- 
quence Flow from the gateway, i.e. when the responding Agent has turned down a 
contract offer, the requesting Agent immediately initiates a new Business Transaction 
with a new (changed) offer for Contract. 

The Contract-Proposal collaboration pattern. Fig. 9 2 , is assembled from the Con- 
tract-Request TP and the Contract-Establishment CP defined above. 




3.3.2 Execution CP 

The execution collaboration pattern specifies relevant TPs and rules for sequencing 
among these within the completion of an Economic Event. The pattern is assembled 
from the Offer-Event and Accept/Reject Event TPs. 
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Fig. 10. Execution CP 



4 Action Dependencies 

The process patterns introduced in the previous section provide a basis for a partial 
ordering of the activities taking place in a business process, in particular the ordering 
based on contract negotiation and execution. We will refer to the activities involved in 
the different phases of a process as contract negotiation or execution activities respec- 
tively. However, the ordering derived from the process patterns only provide a start- 
ing point for designing complete process models, i.e. it needs to be complemented by 
additional interrelationships among the activities. These interrelationships should 
have a clear business motivation, i.e. every interrelationship between two activities 
should be explainable and motivated in business terms. We suggest to formalize this 
idea of business motivation by introducing the notion of action dependencies. An 
action dependency is a pair of actions (either economic events or activities), where the 
second action for some reason is dependent on the first one. We identify the following 
four kinds of action dependencies. 

Flow dependencies. A flow dependency, [12], is a relationship between two Economic 
Events, which expresses that the Resources obtained by the first Economic Event are 



2 When a CP is composed of other CPs, no lanes can be shown as the Requesting and Re- 
sponding Activities are already encapsulated. 
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required as input to the second Economic Event. An example is a retailer who has to 
obtain a product from an importer before delivering it to a customer. Formally, a flow 
dependency is a pair <A, B>, where A and B are Economic Events from different 
Dualities. 

Trust dependencies. A trust dependency is a relationship between two Economic 
Events within the same Duality, which expresses that the first Economic Event has to 
be carried out before the other one as a consequence of low trust between the Agents. 
Informally, a trust dependency states that one Agent wants to see the other Agent do 
her work before doing his own work. An example is a car dealer who requires a down 
payment from a customer before delivering a car. Formally, a trust dependency is a 
pair <A, B>, where A and B are Economic Events from the same Duality. 

Control dependencies. A control dependency is a relationship between an execution 
Activity and a contract negotiation Activity. A control dependency occurs when one 
Agent wants information about another Agent before establishing a Contract with that 
Agent. A typical example is a company making a credit check on a potential customer 
(i.e. an exchange of the Resources information and money in two directions). For- 
mally, a control dependency is a pair <A, B>, where A is an execution Activity and B 
is a contract negotiation Activity and where A and B belong to different Dualities. 

Negotiation dependencies. A negotiation dependency is a relationship between Activi- 
ties in the contract negotiation phase from different Dualities. A negotiation depend- 
ency expresses that an Agent is not prepared to establish a contract with another 
Agent before she has established another contract with a third Agent. One reason for 
this could be that an Agent wants to ensure that certain Resources can be procured 
before entering into a Contract where these Resources are required. Another reason 
could be that an Agent does not want to procure certain Resources before there is a 
Contract for an Economic Event where these Resources are required. Formally, a 
negotiation dependency is a pair <A, B>, where A and B are contract negotiation 
Activities in different Dualities. 



5 A Designers Assistant 

In this section, we will show how a process model can be designed based on process 
patterns and action dependencies. Designing a process model is not a trivial task but 
requires a large number of design decisions. In order to support a designer in this task, 
we propose an automated designers assistant that guides the designer through the task 
by means of a sequence of questions, divided into four steps, followed by a fifth step 
where the process model is generated based on the answers to questions in step 1-4, 
see Fig. 11. 

Step 1 . during which information is gathered about the Agents involved in the busi- 
ness process, the Resources exchanged between them, and the Economic 
Events through which these Resources are exchanged. The result from this 
step is a business model. 

Step2.during which information about the (partial) order between the Economic 
Events is gathered. The result from this step is an ordering of the Activities 
in the Execution phase of a process model. 
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Step 3. during which information about existing negotiation dependencies is gath- 
ered. The result from this step is an ordering of the Activities in the Negotia- 
tion phase. 

Step 4. during which inter phase and inter pool dependencies are established. The 
result from this step is an ordering of Activities that crosses the Negotiation 
and Execution phases. 

Step 5. during which a set of production rules are applied on the results of the previ- 
ous steps in order to generate a process model. 
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Fig. 11. Steps of the Designers Assistant 



5.1 Step 1 - Business Model 

In order to produce a business model the following four questions need to be an- 
swered. Answers according to the running case are given after every question. 

1. Who are the Agents? Answers: Customer (Cust), Distributor (Dist), 

Chicken Supplier (Supp), Carrier (Carr) 

2. What are the Resources? Answers: Money, Chicken, Delivery 

3. What are the Economic Events? Specify them by filling in the following table. 



Table 1. Answers to question 3 



Name of Economic event 


Transferred Resource 


From Agent 


To Agent 


DownPayToDist 


DownPayment 


Cust 


Dist 


FinalPayToDist 


FinalPayment 


Cust 


Dist 


ChickenToCust 


Chicken 


Dist 


Cust 


DownPayToSupp 


DownPayment 


Dist 


Supp 


FinalPayToSupp 


FinalPayment 


Dist 


Supp 


ChickenToDist 


Chicken 


Supp 


Dist 


DeliveryToDist 


Delivery 


Carr 


Dist 


PayToCarr 


Payment 


Dist 


Carr 



4. Group the Economic Events into Dualities by filling in the following table. 



Table 2. The answers to question 4 



Economic event 


Duality 


DownPayToDist 


Chicken Sales 


FinalPayToDist 


ChickenToCust 


DownPayToSupp 


Chicken Purchase 


FinalPayToSupp 


ChickenToDist 


DeliveryToDist 


Chicken Delivery 


PayToCarr 
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The answers to these four questions provide sufficient information to produce the 
business model shown in Fig. 2. 

5.2 Step 2 - Execution Phase Order 

Having identified the Economic Events, the designer is prompted to determine the 
dependency orders. In this step only flow and trust dependencies are considered. 

5. Specify Flow and Trust Dependencies by filling in the table below (where the row 
and column headings are Economic Events identified in question 4). If an 
EconomicEventj (in row i) precedes an EconomicEventj (in column j): put a ’<’ 
symbol in the corresponding cell (cell <i,j>). The '<’ symbol is to be subscripted 
with '/’ or ‘t’ depending on the type of dependency. 



Table 3. Answers to the question 5 in the assistant 
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PayToCarr (PTC) 















The input from this step will be sufficient to roughly sketch the Execution phase in 
the BPMN diagram See the shaded area in Fig. 12 where for every Duality a pool is 
created. The numerical notes in the table are used to refer to the resulting sequence 
and message flows in the model. However, some dependencies, e.g. are later 

on overridden by refined orders and are then reduced from the final model. 

5.3 Step 3 - Contract Negotiation Phase Order 

After having gathered sufficient information to produce the BPMN diagram for the 
Execution phase, the analysis continues for the Contract Negotiation phase. As there 
are two ways for initiating a binary Business Collaboration according to the suggested 
collaboration patterns in Section 3, it is first necessary to identify which of these pat- 
terns to use for each binary collaboration. 

6. For each binary Business Collaboration, aslc whether 

(a) a quotation already exists when the binary collaboration starts, or 

(b) the binary collaboration is started by a partner requesting a quotation. 

If the answer is (a) then the contract establishment collaboration pattern of Fig. 8 will 
be chosen. If the answer is (b) then the contract proposal collaboration pattern of 
Fig. 9 will be chosen. 
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Below, the answers to question 6 for the running case are given in bold. 

6.1 (a) Does a quotation already exist when the Cust-Dist collaboration starts, or 
(b) is the Cust-Dist collaboration started by a partner requesting a quotation? 

6.2 (a) Does a quotation already exist when the Dist-Supp collaboration starts, or 
(b) Is the Dist-Supp collaboration started by a partner requesting a quotation? 

6.3 (a) Does a quotation already exist when the Dist-Carr collaboration starts, or 
(b) is the Dist-Carr collaboration started by a partner requesting a quotation? 



Note that abbreviated Agent names are used here in naming the collaborations above. 
The answers from this question are used to derive the beginning of each binary col- 
laboration (see the white area in Fig. 12). We continue by identifying the negotiation 
and control dependencies. 

7. Specify the Control and Negotiation Dependencies by filling in the following table, 
(where the row and column headings are pattern instantiations identified in ques- 
tions 4 and 6). If an Activity 3 (in row i) precedes an Activity (from column j) put a 
’< ’ symbol (for negotiation dependency) or a ’< ’ symbol (for control depend- 
ency) in the corresponding cell (i.e. cell <i,j> in the table). 

Due to space restrictions and since the running case does not contain any control 
dependencies, we depict contract negotiation Activities only in Table 4. 



Table 4. Answers to the running case for question 7 
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Note, that the relationships within a binary collaboration are given by the process 
patterns and we have therefore crossed out the corresponding cells. The results from 
this question will give input for ordering of the activities from the Contract Negotia- 
tion phase across the binary collaborations. The alphabetical notes in the table refer to 
the resulting flows in the process model in Fig. 12. 

5.4 Step 4 - Refined Order 

In the first three steps, the Agents and Economic Events were identified. Furthermore, 
the activities in the Execution phase were ordered within and between binary Busi- 



3 Formally the contents of the rows are TP instances (see Section 3.1), but for simplicity, we 
have referred to each TP instance by its first Activity. 
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ness Collaborations as well as the activities in the Contract Negotiation phase. In this 
step, we identify relationships that cross binary collaborations as well as Execution 
and Contract Negotiation phases. 

8. For each pair of Economic Events <EE t EEf> ( see Table 3), such that EE t <fEEj : 
Is it required to perform EE t before making a contract acceptance for EEj, (i.e. a 
Contract Establishment between the Agents in EE-)? 

The intuition behind this question is that an Agent may want to ensure that she has 
definite access to certain Resources before she is prepared to enter into a Contract for 
some product where these Resources are needed as input. It is possible to think about 
this question as a strengthening of a flow dependency - we say not only that EEj can- 
not be performed before we have got the Resources from EE;, but even that we are 
not prepared to enter into a Contract for EIli before we have got the Resources from 
EE, 

Below, the implementation of question 8 for the running case is given with an- 
swers. 

8. 1 Must DownPayToDist be done before establishment of Dist-Supp Contract? Yes 

8. 2 Must FinalPayToDist be done before establishment of Dist-Supp Contract? No 

8.3 Must FinalPaytToDist be done before establishment of Dist-Carr Contract? No 

8.4 Must ChickenToDist be done before establishment of Dist-Cust Contract? No 

8.5 Must DeliveryToDist be done before establishment of Dist-Cust Contract? No 

9. For each triple of Economic Events in table 3, EE., EE., EE,, such that EE. <t EE. 
and EE k < f EEy. Is it required to peiform EE i before making a contract accep- 
tance for EE k (i.e a Contract Establishment between the Agents in EE k )? 

This question can be seen as a strengthening of a trust dependency. It says not only 
that we want to see another Agent perform EE ; before we perform EE, but that we 
want to see our partner to perform EE ; before we even start acquiring resources 
needed to perform EEj. 

Below, the implementation of question 9 for the running case with answers. 

9. 1 Must DeliveryToDist be done before establishment of Cust-Dist Contract? No 

9.2 Must ChickenToDist be done before establishment of Cust-Dist Contract? No 

9.3 Must DownPayToDist be done before establishment of Dist-Supp Contract 4 ? No 

9 . 4 Must DownPayToDist be done before establishment of Dist-Carr Contract? Yes 

5.5 Business Process Generation 

The final step of the proposed designers assistant is the generation of a BPMN dia- 
gram based on the answers from steps 1-4. This is achieved using the binary col- 
laboration patterns introduced in Section 3 and a set of production rules to intercon- 
nect those instantiated binary collaborations into a multi-party collaboration. The set 
of production rules that are proposed can be categorized into four groups: Rules for 
Binary Collaborations (within a pool), Rules for Inter-Collaborations (between 
pools), Reduction Rules, and Deadlock Prevention Rules. However, due to space 
limitations, only the first two categories are summarized informally here. 



4 Case 9.3 is already covered by case 8. 1 and only shown here for reasons of completeness. 
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Rules for Binary Collaborations 

1 . For each duality, introduce a binary collaboration contained in one pool. Such a 
collaboration will start with an instantiation of the BPMN diagram Contract Pro- 
pose CP, (Fig. 9), if the answer to question 6 is a), otherwise the binary collabora- 
tion starts with an instantiation of the Contract Establish CP, (Fig. 8). 

2. The BPMN diagram in a pool continues with instantiations of the Fig. 10 pattern, 
one for each Economic Event identified through question 5 in the designers assis- 
tant. These collaborations will initially be in parallel. 

3. For each trust dependency between two Economic Events, introduce a sequence 
flow between the corresponding execution activities. 

Rule for Inter-collaborations 

1. For each flow dependency between two Economic Events, introduce a message 
flow between the corresponding execution activities. 

2. For each control and negotiation dependency between two activities, introduce a 
message flow between these activities. 

3. For each positive answer to questions 8 and 9, introduce a message flow between 
the relevant activities. 

The BPMN diagram generated by these rules for the running case is shown in Fig. 12. 
(A formal definition of the generation via the production rules is found in Chapter 7 
of [9]) 

6 Conclusions and Further Work 

In this paper, we have proposed an approach for building process models on a de- 
clarative foundation. A starting point of the approach is that the process modeling task 
can be supported by gradually gathering domain knowledge, initially for the construc- 
tion of business models and subsequently for their refinement and transformation into 
process models. The proposed designers assistant is structured on a division of e- 
commerce interactions into two phases: a Contract Negotiation Phase where a con- 
tract for exchanging economic resources is established; and an Execution Phase where 
the actual exchanges of the economic resources take place. We believe that this phase 
division provides an adequate starting point, but a topic for further work is to investi- 
gate more refined phase divisions, [6], [15]. 

The proposed approach is based on the concept of process patterns. A framework 
for representing process patterns is introduced, together with a number of basic proc- 
ess patterns. Two kinds of process patterns are identified: transaction patterns, basi- 
cally capturing small communication chunks between two agents; and collaboration 
patterns, which are compositions of transaction patterns facilitating the representation 
of complex interactions. The value of this framework is not only that it provides an 
instrument for precise and unambiguous pattern definitions, but also that it gives a 
basis for motivating design choices in process modeling. 

Finally, we also introduce the notion of action dependencies for capturing relation- 
ships between the activities within a process. Four kinds of dependencies are identi- 
fied: flow, trust, control and negotiation dependencies. They can be stated declara- 
tively, have a clear business motivation, and are used for the final derivation of a 
process model. A topic for further work is to investigate whether additional kinds of 
action dependencies are required. 



Contract Negotiation Phase 
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Fig. 12. Final BPNM diagram for Fried Chicken Case 
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A further line of future work is to examine the quality of the produced models, 
during which their completeness as well as their logical soundness should be investi- 
gated. While the work on completeness can primarily be done through empirical stud- 
ies, the work on logical soundness can be supported by theoretical work like the one 
given in [5], 
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Abstract. A sequence diagram in UML is used to model interactions among 
objects that participate in a use case. Developing a sequence diagram is com- 
plex; our experience shows that novice developers have significant difficulty. In 
earlier work, we presented a ten-step heuristic method for developing sequence 
diagrams. This paper presents a tabular analysis method (TAM) which im- 
proves on the ten-step heuristic method. TAM analyzes the message require- 
ments of the use case, while documenting the resulting analysis in a tabular 
format. The resulting table is referenced to build the sequence diagram. This 
process aids novice modelers by separating the problem analysis from the learn- 
ing curve of a modeling tool. Building sequence diagrams with the systematic 
approach of TAM facilitates consistency with the use case model and the class 
model. We found that developers effectively developed sequence diagrams us- 
ing TAM. 



1 Introduction 

A sequence diagram is a type of interaction diagram, which is used in UML to depict 
a set of messages between objects which participate in a use case [2, 13]. Objects in a 
sequence diagram are typically instances of classes, and the messages passed between 
objects invoke operations of the receiving classes [12]. If we accept as axiomatic that 
the elements of a sequence diagram should be consistent with their corresponding 
elements in the other diagrams of the system model, then there we need straightfor- 
ward construction methods which help the modeler achieve this consistency. 

While seemingly intuitive, methods for constructing a sequence diagram have not 
been discussed much in literature. Our experience shows that novice developers have 
significant trouble in understanding and developing sequence diagrams. Most UML 
books simply explain the notations and semantics and present pre -built sequence 
diagrams. Some authors provide simple guidelines for developing sequence diagrams. 
We found that those simple guidelines are not sufficient for many novice developers. 
Most research activities on sequence diagrams have focused on real time systems [6, 
17], simulation [4] or behavior-driven analysis and design. Very few authors even 
mentioned possible methods, processes or steps that could be used to develop effec- 
tive sequence diagrams. 

Li [7] proposed using a parser to semi-automatically translate use case steps into 
“message records” which can be used to construct a sequence diagram. The parser 
produces a tabular listing of classes, objects, and operations, based on the syntactic 
structure of each sentence in the use case steps. The modeler can then apply this in- 
formation to create the sequence diagram. Li’s method relies on first “normalizing” 
the expression of the use case steps to a somewhat rigorous grammatical model. This 
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normalization depends on a modeler’s English-language skills. Furthermore, although 
normalization of the vocabulary and expressions of use cases theoretically may be 
beneficial, it may not always be feasible in real-world projects because the use case is 
primarily for communication with users, not for input to a computer program. 



1 Select the initiating actor and initiating event from the use case description. 

2 Identify the primary display screen needed for implementing the use case. Call it the Primary 

boundary object. 

3 Create a use-case controller ( primary control object) to handle communication between the primary 

boundary object and domain objects. 

4 I f the use case involves any included or extended use case, create one secondary control object for 

each of them. 

5 Identify the number of major screens nccessaiy to implement the use case. Create one secondary 

boundary object for each of the major screens and create one secondary control object for each of 
them. 

6 From the class diagram, list all domain classes participating in the use case by reviewing the use case 

description. If any class identified from the use case description does not exist in the class diagram, 
add it to the class diagram. 

7 Use those classes just identified as block labels (Column names) in the sequence diagram. List classes 

in the following order: (1 ) The primary boundary stereotype, (2) The primary use case controller, 
(3) Domain classes (list in the order of access), and (4) Secondary control objects and secondary 
boundary objects in the order of access 

8 Identify all problem-solving operations based on the following classifications: 



Instance creation and destruction 
Attribute modification: 


8.2 


Association forming 


8.3.1 Calculation; 


8.3.2 


Change States 


8.3.3 Display or reporting requirements 


8.3.4 


Interface with external objects or systems 



These problem-solving operations can be identified by: 

- Identify verbs from the use case description 

- Remove verbs used to describe the problem: select verbs used to solve the problem. We call these 

verbs problem-solving verbs(PSVs). 

- From the problem-solving verbs, select verbs that represent an automatic operation. We call these 

PSVs problem-solving operations (PSOs) and use them in the sequence diagram. 

9 Rearrange the sequence of messages among the object classes based on any pre-existing design 

patterns, when possible. 

10 Name each message and supply it with optional parameters. This can be done at design stage as well. 



Fig. 1 . The Ten-Step heuristics [18] 

Song [18] introduced a heuristic-based approach (Figure 1) to constructing se- 
quence diagrams. The technique instructs the modeler to pull appropriate elements 
from the prerequisite model artifacts (the use case description and the analysis class 
diagram), and induces some consistency in the resulting model. Our paper proposes 
an enhancement to the heuristic approach of [18]. Similarly to [7], our proposed 
method results in a tabular listing of message records. However, while [7] requires 
“normalization” of the use case steps, our method proposed here requires only that the 
use case be clear and unambiguous, and that ultimately agreement between the use 
case and the sequence diagram can be confirmed. 
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The enhanced method, referred to as the tabular analysis method (TAM), consists 
essentially of reordering the steps of Song’s ten-step heuristic (referred to as the 
“original heuristic”), sequentially applying the reordered steps to each action in the 
use case activity flow, and documenting the resulting analysis in a tabular format. The 
tabular data can then be used as a reference to build the diagram relatively quickly in 
a modeling tool. Some advantages of the tabular method over the original heuristic 
are: 

• The step-by-step process of the TAM is defined in more detail, which should be 
helpful to novice modelers to conceptualize and visualize the building process. 

• With the TAM, a more comprehensive model is relatively easy to construct: the 
tabular format presents the modeler with all elements to be considered for an op- 
eration in an easy-to-read format, which encourages thoroughness in modeling pa- 
rameters, constraints, etc. - resulting in a more semantically complete model with 
minimal added effort. 

• Tool-independence: the tabular data could be exported to an XMI file (see [10] ) 
and then uploaded into the modeling tool of choice, to create the model elements. 
A further conversion of the XMI file to an SVG file (see [11]) could fully auto- 
mate the diagram creation from the table. While the theorized automation capabil- 
ity has not yet been developed, if this potential is realized, the tabular method pre- 
sented here will have significant added value. 

By using the method proposed in this paper, the modeler can express the analysis 
in a more commonly familiar tool (a word processor or spreadsheet program), and 
then reference the analysis worksheet while learning how to construct the elements in 
the CASE tool. In other words, this method separates two analytic processes so that 
each can be more fully attended to by students or novice modelers. 

The rest of this paper is organized as follows. Section 2 discusses issues related to 
the consistency among UML elements that need to be maintained for the accurate 
sequence diagrams. Section 3 presents our TAM for constructing sequence diagrams 
with examples, and Section 4 concludes the paper. 



2 Model Consistency 

Sequence diagrams share model elements with use cases and class diagrams. In this 
section, we discuss consistency issues among use case models, class diagrams, and 
sequence diagrams. 

2.1 Consistency with the Use Case Model 

A sequence diagram represents the design for fulfilling the requirements expressed in 
a use case [5]. The use case elements which should be reflected in the sequence dia- 
gram are postconditions, actions, and related use cases (included, extending, and spe- 
cialized). The postconditions in the use case description specify what state of the 
system must be true upon successful completion of the use case [5], If the sequence 
diagram depicts all the behavior required for successful completion of the use case, it 
follows that each postcondition specified in the use case description must be achieved 
by some message in the set of sequence diagrams for that use case. Conversely, if the 
use case postconditions accurately define the system state, it follows that the use case 
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description should identify as postconditions all final states resulting from execution 
of the use case behavior detailed by the sequence diagram. 

Each action specified or implied in the use case description should be detailed in a 
corresponding message or set of messages in the sequence diagram. Depending on the 
clarity and completeness of the use case description text, the author of the sequence 
diagram may need to infer some of the operations. Song [18] presents several catego- 
ries for identifying operations: instance creation and destruction, association forming, 
attribute modification, calculation, change of states, display or reporting, and interface 
with external objects or systems. Each action in the use case description will require 
one or more of these message types for its fulfillment. 

2.2 Consistency with the Class Diagram 

From the literature review, we have identified several areas where consistency be- 
tween class diagrams and sequence diagrams can be easily confirmed: classes, opera- 
tions, arguments, visibility between objects, and composition responsibility. 

Classes. All entity classes used in a sequence diagram must appear in the Class Dia- 
gram. Conversely, if sequence diagrams are completed for all use cases within the 
project scope, all entity classes shown on the design class diagram must be used in at 
least one sequence diagram, with the exception of some or all abstract classes. Ab- 
stract classes may be shown in certain cases [9, 15]; but generally the receiver of a 
message is a concrete class - the lowest class in its hierarchy to which all instances 
addressed by the message could belong [15]. 

Operations. For an object to handle a message that it receives, it must have a con- 
forming interface, which is defined in the receiver’s class as an operation signature 
[12, 8]. Therefore, all messages shown on the sequence diagram must map to opera- 
tions of the receiving class in the class diagram. A temporary message name may be 
assigned before the class operation has been designed. 

Arguments. A sequence diagram message may transfer information to the receiver as 
arguments. Arguments must represent information that is known to the sender, such 
as attribute values or constants. Depending on the intended precision of the model, the 
sequence diagram may not show all the relevant arguments [3]. However, some pa- 
rameters should always be shown, such as an object or parameter that is being passed 
among multiple other objects [3]. Some practitioners choose not to show all (or even 
any) return messages [12]. Pender argues that it is worth the effort to model opera- 
tions and returns completely, to avoid ambiguity [12], 

Visibility (Relationships Between Classes). In order for objects to exchange messages, 
the sending object must have a handle to the receiving object [15]. Another way of 
saying this is that the sender must have visibility to the receiver. Some authors state or 
imply that a message between two objects in a sequence diagram requires a perma- 
nent association (association, generalization, or aggregation) to be shown between the 
classes in the class diagram [1]. Others note that there ar e four types of visibility pos- 
sible between objects - attribute visibility, parameter visibility, local visibility, and 
global visibility [5, 14] - and that only attribute visibility requires a permanent asso- 
ciation [16]. Messages which rely on parameter, local, or global visibility to a class 
require a temporary, or transient, association between the classes [16]. A transient 
association is modeled on the class diagram as a dependency instead of an association, 
with an arrow depicting the direction of the dependency [5, 14]. To summarize, con- 
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sistency between the class diagram and the sequence diagram requires that for each 
message in the sequence diagram, the class diagram depicts either a permanent asso- 
ciation or a dependency, according to the type of visibility required, between the 
classes of the sender and receiver. Conversely, if an association depicted on the class 
diagram is never used in an interaction, then there must be an error in the model [1], 
Composition Responsibility. If the class diagram depicts a whole-part (composition, 
or strict aggregation) relationship between two classes, then the whole should create 
the part [5, 14]. In the sequence diagram, this is depicted by first creating the compos- 
ite (probably with a created message from a controller), then using the composite as 
the sender for the create() message to the part. 

3 Constructing Accurate Sequence Diagrams with TAM 

In this section, we present our tabular analysis method (TAM) for constructing accu- 
rate sequence diagrams in a manner that enforces consistency with the use case, while 
promoting consistency with the class diagram as well. The TAM uses the ten-step 
heuristic introduced in [18] as a starting point, and applies it methodically to each step 
in the use case description. The TAM takes the procedures expounded in [18] as fol- 
lows: 

• A system sequence diagram (SSD) is constructed first, treating the system as a 
“black box” and modeling only the actions visible to the actor. These actions are 
called system events [5]. 

• Each system event may be documented in one detailed sequence diagram; or de- 
tails for multiple system events may be combined in one sequence diagram. 

• A separate sequence diagram will be constructed for each included or extending 
use case. 

• A separate controller is used for the base use case, and each included or extending 
use case, and functions as a “connector” between the sequence diagrams. 

• Actors communicate with boundary objects, which communicate with controllers, 
which communicate with the entity objects. Normally, actors do not communicate 
directly with controller or entity objects. 

The TAM uses a tabular format called Sequence Analysis Table (SAT) which cap- 
tures the list of use case actions by adding columns for source and receiving objects, 
message names and parameters. Figure 2 shows a condensed version of the empty 
template. The table used here can also be thought of as a condensed version of the 
Tabular Notation described in the UML 2.0 specification [13]. A Sequence Analysis 
Table is created for the primary use case and each included or extending use case. The 
overall process of the TAM can be summarized as follows: 

— From the use case description, create a system sequence analysis table (SSAT) to 
create a system sequence diagram. Note that each line here is an input system 
event from an actor to a system or an output from the system to an actor. 

— Expand each line of SSAT in such a way that each system event or output can be 
broken into multiple messages that can be represented in a sequence diagram. Into 
detailed sequence analysis table. 

— Each included or expanding use case description results in a different detailed 
SAT. 

— Create a sequence diagram from the detailed sequence analysis table. 
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Step 


Use case action 


Message Name 


Parameters 


Constraints 


Sender 


Receiver 

























































Fig. 2. The Template for Sequence Analysis Table (SAT) 



Actor Action 


System Action 


1 . Stall' starts “New Rental" mode on POS. if POS is 
not already in the correct mode. 




2.The staff verifies the customer’s status by their ID 
card or number. 


3. System finds and displays the customer’s 
information. 




4. INCLUDE Get Overdue Fees. Any overdue 
items are displayed with tape information, due 
date, and late fee amount due. 


5. The staff records new rental items by scanning the 
bar codes. 


6. System determines rental price and due date 
and displays with title, date and price for each 
item. 


7. On completion of last item entry, the staff indicates 
to the POS terminal that the rental is complete. 


8. System calculates tax and total rental fee, and 
records with date out. Late fees determined in 
prior step are added to the balance due. 


9. Stall' accepts payment from customer for the 
balance duo. 


10. INCLUDE Pay Fees. 




11. IF payment is successful, on-hand inventory 
is reduced by one for each rented item, and 
receipt is printed. 


12. Stafl' hands items and receipt to customer and 
concludes transaction. 





Fig. 3. The Main Success Scenario of Use Case “Process Rents” of VRS Use Case Description 



3.1 Getting Started - Designing the System Sequence Diagram (SSD) 

In this section, we discuss how to create a system sequence diagram in the TAM. The 
first step is to copy the actions from the use case description document (UCD) to the 
empty template. Each input from an actor to the system (called a system event) and an 
output from the system to the actor forms a row in SAT. Next, for each row, enter a 
short message name that describes the primary communication. Depending on the 
quality of the UCD steps, some editing may be required at this point. 

The next step is to identify sending and receiving objects as the initiating actor and 
a boundary object representing the system user interface. Evaluate each subsequent 
action as “from the actor to the system” or “from the system to the actor”. For “sys- 
tem” put “BO”(meaning a boundary object that represents the system being modeled) 
in the “Sender” or “Receiver” column. It is not necessary to name boundary objects 
yet. 

An example Use Case Description of the use case “Process Rents” for a Video 
Rental System (VRS) case study is shown in Figure 3 (only the Main Success Sce- 
nario is shown). The resulting SSD table is shown in Figure 4; and the resulting SSD 
diagram is shown in Figure 5. Refer to [18] for the problem statement, the use case 
diagram, and the class diagram of the VRS case study. 
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s# 


Use case action 


Message Name 


Parameters 


Constraint 


Sender 


Receiver 








s 






1 


Start' starts “New Rental" mode 
on POS. if POS is not already in 
the correct mode. 


start_rental 






Staff 


BO 


2 


Staff verifies the customer's 
status by their ID card or 
number. 


enter_customcr 






StalT 


BO 


3 


System finds and displays the 
customer’s information. 


display_customer 






BO 


Start' 


4 


INCLUDE Ciet Overdue Fees. 
Any overdue items are displayed 
with tape information, due date, 
and late fee amount due. 


display_ovcrdue 






BO 


Stafl' 


5 


The staff records new rental 
items by scanning the bar 
codes. 


enter_rental_items 




’(until 

end_rental 

received] 


Staff 


BO 


6 


System determines rental price 
and due date and displays with 
title, date and price for each 
item. 


display_item_data 




|enter_renta 
l_i terns 
received] 


BO 


Staff 


7 


On completion of last item 
entry, the staff indicates to the 
POS terminal that the rental is 
complete. 


end_rental 






Staff 


BO 


8 


System calculates tax and total 
rental fee, and records with date 
out. Late fees determined in 
prior step are added to the 
balance due. 


display_total_due 






BO 


Staff 


9 


Start’ accepts payment from 
customer for the balance due. 


n/a 






Customer 


Staff 


10 


INCLUDE Pay Fee. 


enter_payment 






Actor: Staff 


BO 


II 


IF payment is successful, on- 
hand inventory is reduced by 
one for each rented item, and 
receipt is printed. 


print_rcceipt 




payment is 
successful 


BO 


Staff 


12 


Staff hands items and receipt to 
customer and concludes 

transaction. 


n/a 




payment is 
successful 


Staff 


Customer 



Fig. 4. Resulting System Sequence Analysis Table (SSAT) for System Sequence Diagram of 
the VRS Example 



3.2 Defining the Sequence Diagram (SQD) Details 

In this section, we show how to build the detailed sequence diagram in the TAM. At 
this point, the template contains a single line for each system event between the Actor 
and the System, as shown in Figure 4. Based on the ten-step heuristic, this section 
shows how to decompose the single interaction into multiple messages at the detailed 
level, as shown in Figure 6. 

A detailed sequence diagram must show the interactions within the system, be- 
tween various objects, as well as the parameters and constraints relevant to the mes- 
sages. The following steps describe a methodical approach to completing the table. In 
the resulting table, there will be one line for each message shown on the sequence 
diagram. That means, in completing the detailed information, it will be necessary to 
insert lines in the table wherever multiple messages are required to implement a use 
case step - which will be true for almost all of the use case steps. The steps described 
below are illustrated in Figure 6 through Figure 8 for the “Enter Customer” system 
event of the VRS use case. 
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Stall 



< 

< 



< 



enter_customer 



display customer 
display overdue 

’enter rental items 



display item data 



end rental 



display total due 
enter_ payment 

print_ receipt 



: VRS System 



* (until end rental 
received] 






Fig. 5. Resulting System Sequence Diagram ( SSD) for use case "Process Rents" 



Stei 


Use case action 


Messaqe Name 


Parameters 


Constraints 


Sender 


Receiver 


n 














1 


Staff starts “New Rental” 
mode on POS. if POS is 
not already in the correct 
mode. 


start, rental 






Actor: 

Staff 


Rental 

Window 


1.1 




start, rental 






Rental 

Window 


Rental 

Handler 


1.2 




request_cust_id 






Rental 

Handler 


Rental 

Window 


2 


Staff verifies the 

customer's status by their 
ID card or number. 


enter customer 


custJD 




Actor: 

Staff 


Rental 

Window 


2.1 




get.cust 


custJD 




Rental 

Window 


Rental 

Handler 


2.2 




get_cust 


custJD 




Rental 

Handler 


Customer 


2.3 






custom er_ 
data 




Customer 


Rental 

Handler 


2.4 






customer, 

data 




Rental 

andler 


Rental 

Window 


3 


System finds and 

displays the customer's 
information. 


display_customer 


custJD 




Rental 

indow 


Actor: 

Staff 



Fig. 6. Expansion of use case step 2 of Figure 4 for communication with controllers 
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Specify details for initial lines 

1 . Name the Primary Boundary Object, and replace all instances of “BO” in the table 
with the selected name. 

• In VRS example: RentalWindow. 

• This is analogous to step 2 of the original heuristic. 

2. Add a use case controller (CO) to the Receiver column for each Included or 
Extending Use Case. The detailed steps for each included or extending use case 
will be documented in a separate Sequence Analysis Table. 

• In VRS example, the controllers are named: for “INCLUDE Get Overdue 
Fees” - AccountHandler; for “INCLUDE Pay Fees” - PaymentHandler. 

• This is analogous to step 4 of the original heuristic. 

3. Identify constraints: everywhere a qualifying word such as “if’ appears in a use 
case step, there should be a constraint for one or more messages. 

• Initially, we suggest to address only the “Main Success Scenario” of the use 
case. However, to fully complete the set of sequence diagrams for the use case, 
the modeler must verify that all alternatives are documented. For example, in 
the VRS case there is a qualifier “If payment is successful” in the main success 
scenario. The alternative path - payment is not successful - is addressed in the 
“Other Successful Scenarios” and “Unsuccessful Scenarios” of the use case de- 
scription. Initially, and for this paper, the alternative paths will not be devel- 
oped. 

Expand the table by adding lines for required messages - decompose each use case 
action as follows: 

4. Add the lines for communication between the primary boundary object to and 
from the appropriate controller. This incorporates step 3 of the original heuristic. 

5. Identify the problem-solving operations as described in [18] Heuristic step 8. 

6. Identify the message parameters (data) and the classes to which the corresponding 
attributes belong. This step corresponds to step 6 of the original heuristic. 

• If the sequence diagram is intended to be completed exhaustively and precisely, 
all parameters should be determined, by consulting the analysis class diagram 
for entity classes and attributes which satisfy the semantics of the use case step. 

• If a less exhaustive documentation is sufficient, specify only the most important 
parameters. Alternatively, a group of attributes may be named, as in “cus- 
tomer_data” rather than listing customer name, address, etcetera. Always iden- 
tify all entity classes which will be involved in each use case step. 

• If the need for additional entity classes is discovered, add the new classes to the 
class diagram, with the required attributes. 

• If the need for additional attributes for an entity class is discovered, add the 
new attributes to the class in the class diagram. 

7. For each entity class identified in step 6, insert a line in the table, between the 
lines just added for communication with controllers. Add the object representing 
the entity class in the Receiver column, using instance notation (e.g., :Customer ) 
if an instance is appropriate (almost always the case - if in doubt, assume an in- 
stance is required). The receivers should be listed in the order they will be ad- 
dressed. 
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Step 

# 


Use case action 


Messaqe Name 


Parameters 


Constraints 


Sender 


Receiver 




1 


Staff starts "New 
Rental" mode on 
POS. if POS is 
not already in the 
correct mode. 


start_ rental 






: Staff 


: Rental 
Window 




1.1 




start, rental 






: Rental 
Window 


Rental 

Handler 








create, rental 






: Rental 
Handler 


: Rental 




1.2 




request, cust.id 






: Rental 
Handler 


: Rental 
Window 




2 


Staff verifies the 
customer's status 
by their ID card 
or number. 


enter.customer 


cust.lD 




: Staff 


: Rental 
Window 




2.1 




get.cust 


cust.lD 




: Rental 
Window 


: Rental 
Handler 




2.2 




get.cust 


cust.lD 




: Rental 
Handler 


: Rental 




2.3 




get.cust 


custJD 




: Rental 


Customer 




2.4 






customer_data 




: Customer 


: Rental 




2.5 






customer_data 




: Rental 


: Rental 
Handler 




2.6 






customer.data 




: Rental 
Handler 


: Rental 
Window 




3 


System finds and 
displays the 

customer’s 
information. 














4 


INCLUDE Get 
Overdue Fees. 
Any overdue 

items are 

displayed with 
tape information, 
due date, and 
late fee amount 
due. 


get.overdue 


cust.lD 




: Rental 
Window 


: Account 
Handler 




4.1 




display.overdue 


tapelnfo, 

dueDate. 

lateFeeDue 




: Account 
Handler 


: Rental 
Window 




4.2 




display.customer 


customer.data, 

overdue.data 




: Rental 
Window 


: Staff 





Fig. 7. Detailed Sequence Analysis Table for Enter Customer system event 



8. For messages which are procedure calls, the Sender for each new line is typically 
the Receiver of the prior line. Thus the Sender in the first inserted line is the con- 
troller of the use case step. However, this rule will not always result in the correct 
allocation of responsibility. See [5] for discussion of allocating responsibility to 
objects; in proceeding through the following steps, the modeler should re-arrange 
the Senders and Receivers as necessary to assign responsibilities correctly. 

9. Name the messages and distribute the parameters among the lines pertaining to the 
step. This step corresponds to step 10 of the original heuristic. 

• Some parameters to be displayed must be calculated. Insert a line for each cal- 
culation; identify the attributes needed for the calculation and list as parame- 
ters. 

• Verify that every parameter to be displayed is either retrieved from a class or 
calculated based on attributes retrieved from a class, and that all of the classes 
involved are listed. 
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A 



Staff Runlnl Window 

start_re«tal( 



RentalHa»dtef 



) I stan_rental( ) 

— ^ : > i 

ask eustlDt \ ^ 



Customer 



ask_cu$tlD 



AccountHandier 



enter_customet(cust _ID) 



— ^ g«t_cust(cu$t _ID) 



(customer. data) 



— ► get cust(cust. ID) 



— ► get_cust{cust_ID) 



(customer.data) 



(customer, data) < 



display .customer 



. get. overduelcust JD) 


< 


(tapemfo. dueDate. lateFeeDue) 





Fig. 8. Sequence Diagram for “Enter Customer” system event 



• Indicate iteration of a message with a * (example: *get_item_info). 

• Be sure to identify and add to the table, any messages with the same Sender and 
Receiver (such as calculations). 

• Figure 7 shows how the steps to this point have been applied for the first two 
use case steps of the table. The expansion of use case step 2 to add communica- 
tion with controllers and the entity class Customer is outlined in bold. 

10. Review the constraints originally entered for the use case steps, and copy these as 
necessary into the new lines for the added messages. 

Throughout the analysis, identify any clarifications needed, or missing or implied 
steps in the use case. Insert rows and add steps as needed; highlight changes to the use 
case steps in order to go back and update the UCD for consistency later. If it is appar- 
ent that additional major screens will be required, create a secondary boundary object 
and secondary controller for each major screen that is needed in addition to the main 
screen for the use case. Insert lines in the table as needed for passing messages from 
the primary controller to each secondary boundary object. 

At this point, the table is complete for the main success scenario of the primary use 
case. Similar tables should be constructed for the included and extending use cases. If 
it is desired to create a generic sequence diagrams which includes all scenarios, mes- 
sages and qualifications can be added as necessary by inserting rows in the table, to 
include the information covered in the “Other Successful Scenarios” and “Unsuccess- 
ful Scenarios” sections of the UCD. 

Once satisfied that the table represents all data required for the sequence diagram, 
the modeler can create the diagram by referring to the table. Since the classes have 
already been modeled, the modeler merely drags the required classes into the se- 
quence diagram. The classes should be ordered in the diagram as proposed in step 7 
of the original heuristic. Then the message names, parameters, and qualifications can 
be copied and pasted from the table into operations in the modeling tool. The mes- 
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sages should be shown on the diagram in the correct relative time order. As men- 
tioned in the Introduction, we envision a future capability to export a table developed 
with this method into an XMI file which can then be imported into any XMI- 
compatible tool, thus further simplifying the diagram creation. 



4 Conclusion 

This paper has proposed the tabular analysis method (TAM) for constructing accurate 
sequence diagrams.. The proposed method is rigorous and, applied as envisioned, 
results in a thorough modeling of operation elements. As such it may be considered 
tedious, but has the advantage of separating model analysis from the vagaries of tool 
usage. Therefore, it may prove ideal for application in the following circumstances: 

• learning environments; 

• situations where there is a need for comprehensive sequence diagrams; and 

• situations where there may be multiple modeling tools in use, or tool selection is 
not complete at the time modeling needs to begin. 

In addition, the tabular format used in this method is anticipated to be adaptable to 
automated model interchange in accordance with the OMG specifications for XML 
Metadata Interchange [10] and UML 2.0 Diagram Interchange [11]. Successful de- 
velopment of conversion scripts to realize this automation will enhance the applicabil- 
ity of the tabular approach, to the point where even experienced modelers may find it 
useful for quickly documenting interaction sequences. 
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Abstract. UML is a language for specifying, visualizing and documenting ob- 
ject-oriented systems. However, UML statecharts lack precisely defined syntax 
and semantics. This paper provides a method of formalizing semantics of UML 
statecharts with Z. According to this precise semantics, UML statecharts are 
transformed into LREE (Llattened Regular Expression) state models. The hier- 
archical and concurrent structure of states is flattened in the resulting TREE 
state model. The model helps to determine whether the software design is con- 
sistent, unambiguous and complete. It is also beneficial to software testing. 



1 Introduction 

The Unified Modeling Language (UML) is a de facto standard for documenting the 
specification and design of object-oriented systems. The continuously growing popu- 
larity of this notation has led software developers to use UML to model application 
domains that were originally out of the language scope. These domains include busi- 
ness processes, Web-based applications, information systems, component-based sys- 
tems, etc. In general, the rich set of diagrams provided by UML, together with a flexi- 
ble extension mechanism, allow developers to model all the relevant features of 
software systems [1]. UML’s very advantages are given by a great variety of intuitive 
and mostly well-known notation for different kind of information to be specified: 
requirements, static structure, interactive and dynamic behavior as well as physical 
implementation structures. However, this intuitive appeal comes at the prize of an 
insufficient definition. Whereas the UML syntax is defined in quite a precise and 
complete manner, its semantics is not. 

Statechart diagrams were originally introduced by David Harel [4] in the mid 80’ s 
of the twentieth century. The notation and semantics of UML statecharts were adapted 
from the Harel’ s original version with the addition of object-oriented features [1], 
UML statechart diagram is an important part of the standard UML language [1], UML 
statecharts extend ordinary state transition diagrams with notions of hierarchy and 
concurrency [1]. They are a visual language and are typically used to model the dy- 
namic behaviour of a class of UML objects. This language has proved useful in mod- 
eling complex control aspects of many software systems. UML statechart diagram is a 
highly expressive hierarchical modelling language with well defined syntax [1], Un- 
fortunately, its precise semantics are not well formalized. 
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However, there are some semantic differences between the two notations of UML 
statecharts and David Harel’s statecharts. One most major difference is that causal 
paradoxes avoided by introducing the notion of steps [3,4,5,7,10,11] in the classical 
statecharts are not an issue with the run-to-completion assumption in UML, which 
allows that an event only can be dispatched when the processing of the previous event 
is fully completed (refer to [1] for other semantic differences relative to classical 
statecharts). For these reasons, although several semantics have been proposed for 
classical statecharts [3,4,5,7,10,11], it is worthwhile to define a formal semantics for 
the UML statecharts. A formalization of the state machine package in the UML 
metamodel using Object-Z was presented in [2]. However, there was not formal syn- 
tax and semantics of whole UML statechart in [2]. [8] described a formal semantics of 
special UML statechart but not general UML statechart, and it didn’t consider the 
concurrent mechanism. [9] described the syntax of UML statecharts with the Graph- 
type Definition Language and specified the semantics of UML statecharts by an Ab- 
stract State Machine in a heterogeneous modeling environment, but did not describe a 
formal semantics of UML statecharts in general modeling environment. In [12], start- 
ing with a precise textual syntax definition, they developed a structured operational 
semantics for UML statecharts based on labeled transition systems. In [13], an opera- 
tional semantics for a subset of UML state machines was proposed. [14, 15] gave a 
formal syntax and semantics of UML statecharts using mathematics method. 

In this paper, we use Z notation rather than “standard mathematics” to formalize 
UML statechart. Z has been used for precisely describing user’s requirements, and has 
been used for a number of digital systems in a variety of ways to improve the specifi- 
cation of computer-based systems [16]. A lot of textbooks on Z are now available 
[17], The teaching of Z has become of increasing interest [18]. The Z notation has 
many supporting tools for its convenient representation. And it was widely used by 
many people. 

We present the formal definition of the syntax and semantics of UML statecharts, 
extend the definition of firing priorities between the two conflicting compound transi- 
tions based on [1], In this paper, the transition labels are restricted to that: the only 
effect of actions is the generation of events. 

It’s difficult to generate test cases of class directly from the UML statecharts dia- 
grams that contain hierarchical and concurrent structure. According to our precise 
semantics, a UML statechart diagram can be transformed into FREE state model [6]. 
For example, a UML statechart diagram shown in Fig.l can be transformed into a 
FREE state model shown in Fig. 2. The hierarchical and concurrent structure of states 
is flattened in the resulting FREE state model. The model helps to determine whether 
the software design is consistent, unambiguous and complete. A UML model that 
follows the FREE conventions will be testable [6]. It is beneficial to software testing. 

In section 2, the formal semantics of UML statecharts is defined. Section 3 gives 
FREE state model. Finally, some conclusions are draw. 

2 Formal Semantics of UML Statecharts 

2.1 Well-Formed UML Statecharts 

A set of states at different levels forms a state hierarchy: the states contained within a 
state are called substates of the surrounding state; the surrounding state is called the 
composite state and higher than the states it contains in the hierarchy. The highest 
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state is called the top state which is not surrounded by any state. In the example 
shown in Fig. 1, SO is the top state in the state hierarchy SI and 52 are its substates. SO 
is an ancestor of SI and 52. 




Fig. 1 . An Example of UML statechart diagram 

State Names and State Types. We postulate a finite, nonempty set of state name X 
and denote the types of states by TYPE. We denote non-concurrent composite state by 
NCCS, concurrent composite state by CCS, the initial state by INITIAL and the final 
state by FINAL. 

El 

3n: N»#X= n 

TYPE ::=SIMPLE\NCCS\CCS\INITIAL\FINAL 

Definition of State Hierarchy. A state hierarchy STATETREE consists of the follow- 
ing components: the root top of the tree, the finite hierarchy function p which assigns 
a (possibly empty) set of direct substates to an ancestor state and the finite typing 
function iff which assigns to each state its type. For example, the direct substates of 
state SO are initial, SI and 52, namely p (SO) = {initial, SI, 52}, !// (SO) = NCCS, !// 
(52) = SIMPLE, \ff (S3) = CCS. The schema STATETREE defines these objects is 
given in next page. 

Simple Transition. A simple transition connects two states. Simple transition labels 
in UML statechart have the simple structure 

event[ guard ]I action 
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where event is trigger event, guard is a boolean condition, and action is an action. All 
three parameters are optional. 

We define a finite set of events. 

[EVE] 

3n: N • #EVE = n 

As mentioned above we restrict the action part to the generation of events only. 
The schema LABEL defines the set of labels. 




A simple transition leads from a state denoted by source to another state denoted 
by target. Simple transitions are labeled. The schema TRANSITION defines the set of 
simple transitions. 




Now we can give a well-defined condition that allows to compose a state hierarchy 
and a set of simple transitions into a well-defined statechart. 

Well-Formed Statecharts. The consistency between the root, the initial state, the 
final state and the set of simple transitions is as follows: 

1 . A simple transition connects a source state to a target state. The source and target 
can be composite state. 

2. The top state is neither a target nor a source of any simple transition. 

3. An initial state has exactly one outgoing simple transition (called initial transition 
and indicated as initialt ) and no incoming simple transitions. [ 1 1 

4. A final state has at least one incoming simple transition (called final transition and 
indicated as finalt) and no outgoing simple transitions. 

These requirements are formalized in the schema STATECHARTS. 

— STATECHARTS 1 

tree : STATETREE 
tset: F| TRANSITION 
Vt: tset • # { t. source } = I a# { /. target ]= 1 
Atree.topt{t. source, t. target } 

V.v:Z»(( ifKs)=INlTIAL^(si {/: tset • t. target}) 

a(3| initialt: tset • initialt.source= s)) 
a( tp(s)=FINAL=>(si J t: tseft. source } )a( 3/inall: tseffmalt. larget=s)) ) 



2.2 State Configurations 

When dealing with composite state, the simple term “current state” may be quite con- 
fusing. In UML statecharts more than one state can be active at once. The current 
active “state” is actually represented by a tree of states starting with the single top 
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state at the root down to individual simple states at the leaves. We refer to such a state 
tree as a state configuration such as { SO, SI, S4 }, { SO, SI, S3, S5, S6, S7, .S' 9 1 . Except 
during transition execution, only one state configuration is active, and the following 
invariants always apply to state configurations [ 1 ] : 

• If a non-concurrent composite state is active, exactly one of its substates is active. 

• If a concurrent composite state is active, all of its substates (regions) are active. 

We introduce functions to reason about state hierarchy and we identify a relation 
between a set of states and an ancestor state. This relation helps us to define the state 
configuration function confset. 

Substates and Ancestors. We define extensions of p: its reflexive, transitive closure 
p* and its non-reflexive, transitive closure p + . The ancestor relation is expressed by 
ancestor. The strict ancestor relation is expressed by strancestor. 

p, p: TxSTATETREE ■** FI 

ancestor _, strancestor_: F|(IxF|I xSTATETREE ) 

V s: I, tree: STATETREE • 

p(s,tree)=\ s }<u^J\sub: tree.p( s)»p(sub, tree)) 
a p ( s, tree)= p (s, tree) \ { s } 

Vane: I; set: F|I; tree: STATETREE • 

(ancestor(anc, set, tree) <±>set c p (anc, tree)) 

A(strancestor(anc, set, tree) c^sel c p (anc, tree)) 



State Configurations. Given a STATETREE tree and a state s that isn’t an initial state 
neither a final state, let CONF be a set of state configurations containing s, every 
configuration in CONF is a set of states conf obeying the following rules: 

• conf contains tree. top. 

• If conf contains a state st that is not tree.top, it must also contain the ancestor of st. 

• If conf contains a state st of type NCCS, it must also contain exactly one of st’s 
substates. 

• If conf contains a state st of type CCS, it must also contain all of st’s substates. 

The only states that are in conf are those that are required by the above rules. Ob- 
viously, a configuration must contain at least one simple state. Generally, a configura- 
tion must not contain a pseudostate and a final state. The following definition is a 
direct interpretation of the rules above. Let CONF be a set of configuration that con- 
tains s. 



confset : Ix.S'7'.-t TETREE -» F| F|I 

dom(confset)= {Vs: I; tree : STATETREE \ s *■ tree.top 

a tree. i/As) *INITIALa tree. i/A.s) *FINAL } 

Vs : I; tree : STATETREE ; CONF : F,F,I* confset(s , tree)=CONF< => 

'si conf : CONF •tree. topeconfA seconf 

a (VsP.conf •(tree, ipst )=NCCS=>(3 t suh:tree.p(st)» subeconf)) 
a (tree. i/A.st)=CCS=>tree.p(st)c.conf ) 
a (st * tree. top aU anc : !• ancestor(anc,{st), tree))=>anceconf) 
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For example, in a STATETREE tree, the state 52 maps into the set of state configu- 
rations { {50, 52}} and the state S3 maps into the set of state configurations { {50, 57, 
S3, 55, 56, 57, 59}, {50, 57, S3, 55, 56, 57, 570}, {50, 57, S3, 55, 56, 55, 59}, {50, 
57, 55, 55, 56, 55, 570} } by applying function confset. We will define a special con- 
figuration conf 0 that contains initial state later. 

We now introduce function about target configuration that helps us to calculate the 
target configuration of a compound transition. 

Default Configuration. The default configuration of a state st, denoted by function 
defaconffst, tree), is defined as a configuration conf containing st such that, for every 
NCCS s of state configuration conf, if the state .v is not a strict ancestor of the state st, 
then the default substate of s is also in conf. A default substate of a NCCS is the arriv- 
ing state of transition initialt in the NCCS. According to definition of state configura- 
tion, defaconffst, tree) has exactly one state configuration. 

defaconf: fjtSTATETREE — > F|Z 

Aom(defaconf)= {Vs : Z; tree : STATETREE \ s* tree. top 

/\tree. i/As) *INl TIA I.Atree. yA s ) *EINAL } 

Vs/: I; tree:STATETREE; conff{L\ CONF: Ff,!* 

defaconf (st, tree)=confa’3\conf : CONE • CON E=confset(st , tree) 
a Vs : conf • (( tree. yAs)=NCCSA->strancestor(s, {s/}, tree)) 

=>(3i defa : tree. pis) •initialt. target = defa 
Adefaeconf)) 

For example, in a STATETREE tree, the state 57 maps into the state configuration 
{50, 57, S3, 55, 56, 57, 59} by applying function defaconf, the state S3 maps into the 
state configuration {50, 57, S3, S5, S6, S7, 59} by applying function defaconf and the 
state 55 maps into the state configuration {50, 57, S3, 55, 56, 55, 59} by applying 
function defaconf. 

2.3 Compound Transitions 

A compound transition is a derived semantic concept in UML statecharts, represents a 
“semantically complete” path made of one or more simple transitions, originating 
from a state configuration and targeting another state configuration. 

We deal with simple transitions in UML statecharts where several special simple 
transitions are restricted: the initial transition initialt and the final transition finalt 
have no trigger and action, and their guards are true. 

If the target state of a simple transition is composite state, a compound transition 
associated with the simple transition is composed of the simple transition and one or 
more initial transitions. If the source state of a simple transition is composite state, a 
compound transition associated with the simple transition is composed of one or more 
final transitions and the simple transition. If a simple transition is neither an initial 
transition ( initialt ) nor a final transition (finalt ), then the simple transition can corre- 
spond to one or more compound transitions. There are more possible state configura- 
tions in the source state of a simple transition. 

Compound Transitions. The compound transition ct is a triple tuple (souconf cl, 
tarconf ) and there exists a simple transition te TRANSITION, souconfe confset 
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(t. source, tree), tarconf= ( souconfXDS ) uA5, cl=t. label. The greatest departing state 
of a simple transition t denoted by gels, is a state .v such that t. source e p is, tree), 
t. target <£ p*(s, tree), and every such state is a substate of .v. The departing states of t, 
denoted by 7)5, is defined as a set of states p*(gds, tree). The greatest arriving state of 
a simple transition t denoted by gas, is a state .v such that t. source <£ p*\s, tree), 
t. target e p*(s, tree), and every such state is a substate of s. The arriving states of t, 
denoted by A .S', is defined as a set of states p igas, tree jrsdefaconf (t. target, tree). 
Except during compound transition execution, there is only one active state configura- 
tion in UML statechart. 




For example, a simple transition t7 corresponds to one compound transition ct71 
(refer to Fig. 2). The state S2 maps into the set of state configurations {{SO, S2 } } by 
applying function confset. In this example, t. source =S2, t. target =S3, gds=S2, 
DS={S2}, gas=Sl, p*(gas, tree) = {SI, S3, S5, S6, S7, S8, S9, S10 }, defaconf {t. target, 
tree)= {SO, SI, S3, S5, S6, S7, 59}, AS = p*(gas, tree)r\defaconf (t. target, tree)={Sl, 
S3, S5, S6, S7, S8, S9, S10}rs{S0, SI, S3, S5, S6, S7, 59}= {57, S3, S5, 56, 57, 59}, 
souconf = {50, 52}, tarconf = (. souconfXDS ) uA5 =({50, 52}\{52})u{57, S3, 55, 56, 
57, 59} = {50, 57, S3, S5, S6, S7, 59}. 

Enabled Compound Transitions. A compound transition is enabled if and only if: 

• Its source configuration is active configuration. 

• The trigger of the compound transition is satisfied with the given event. 

• Its guard condition is true. 

In configuration conf and event eve, enabledness of a compound transition ct is 
captured in the next definition. 

enabled _ : ¥(COMTRAN x F,I xEVE) 

Bconf : F|Z; eve : EVE ; Vc7 : COMTRAN • enabled(ct , conf, eve)» 

(ct. souconf = conf 
Act. cl. event = eve 
Act.cl.guard) 

Conflicting Compound Transitions. The scope of a compound transition is the in- 
tersection of originating and targeting configurations. For example, scope of a com- 
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pound transition ct32 (refer to Fig. 2) is the set of states { SO, SI, S3, S5, S6, S3 } and 
scope of a compound transition ct54 is the set of state { SO, SI } . Since more than one 
compound transition may be enabled by the same current event, being enabled is a 
necessary but not sufficient condition for the firing of a compound transition. Two 
compound transitions are said to conflict if they both originate from the same configu- 
ration, both triggered by the same event and both guard conditions are true. In case of 
conflicting compound transitions, only one of them will fire in a run-to-completion 
step. For example, two compound transitions ct32 and ct54 (refer to Fig. 2) that origi- 
nate from the same state configuration { SO, SI, S3, S5, S6, S8, S9] are conflicting if 
they both triggered by the same event and both guarded conditions are true. 
scope : COMTRAN F,Z 
conflicting : F( COM TRAN x CO MTRAN x F,I) 

Vet : COMTRAN • scope(ct) = ct.souconf n ct.tarconf 
3 t on/ : F,I; Veil, cl2 : COMTRAN • conflicting{ctl,ct2,conf)<=> 

(ctl*ct2/\ctl .souconf = conf/\ct2.souconf= conf 
Ad I .cl. event = cl2. cl.event Ad l.cl. guard Act2. cl. guard) 

Firing Priorities. The firing priorities between the two conflicting compound transi- 
tions are determined by comparing the scope of two conflicting compound transitions. 
For two conflicting compound transition, if scope of a compound transition is a 
proper subset of the other one, then its compound transition firing priority is lower 
than the other one. For example, a compound transition ct32 (refer to Fig. 2) has 
higher firing priority than other compound transition ct54 if they both are conflicting. 
Only higher firing priority of a compound transition will fire for two conflicting com- 
pound transitions. These priorities resolve some of the compound transition conflicts, 
but not all of them. In the case of more conflicting compound transitions and the same 
firing priority, one of them can be chosen to fire in a run-to-completion step. 

2.4 Definition of the Run-to-Completion Step 

Status of a Statechart. The set of run-to-completion steps of UML statecharts STEP 
are sequences of statuses. A status consists of three components: active state configu- 
ration, current event and event queue. 

act conf. a state configuration in which the system currently resides; 
curevent : a current event that was dequeued and dispatched in the previous step 
evei queue: an event queue that holds incoming event instances until they are dis- 
patched. 

The schema STATUS defines these objects. 




Initial Status. We will define a special configuration conf 0 that contains initial state. 
The only active state configuration in the first status of UML statecharts STEP is the 
initial configuration conf 0 that contains tree. top and initial state (conf 0 =f tree. top, 
initial}). The current event and the event queue are empty. We define an event con- 
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stant, empty event empevent before schema initial status INITSTATUS is presented as 
follows. 




Definition of the Run-to-Completion Step. The run-to-completion step is the pas- 
sage between two state configurations of the UML statecharts [1], We postulate that 
the event queue is a first-in first-out queue. Events are generated as a result of some 
action either within the system or in the environment surrounding the system. The 
events are added to the event queue evequeue. 

The semantics of event processing is based on the run-to-completion assumption, 
interpreted as run-to-completion processing. Run-to-completion processing means 
that an event can only be dequeued and dispatched if the processing of the previous 
current event is fully completed. 

The following list is our remarks that explain how it relates to our definition in the 
Z schema STEP. 

• Compute the set of enabled compound transitions (corresponds to the set ECT). 

• Remove from this set all compound transitions that are in conflict with an enabled 
compound transition of higher priority (corresponds to the set ETHP). 

• If there are no enabled compound transitions then the step is empty, else there 
choose one compound transition nondeterministically for execution. Let cl be the 
choice. The action of compound transition ct is executed. 

• The action event queue ct. cl. action that were generated by the action of the exe- 
cuted compound transition are catenated the back of the event queue evequeue. The 
front event in the event queue evequeue is removed and became the current event 
curevent in the next step if the event queue evequeue is not empty. 

(This corresponds to the assignments to the variables actconf curevent’ and 
evequeue’ .) 
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3 FREE State Model 

3.1 Definition of FREE State Model 

In the formalism of Mealy model, event and action can be associated to a transition, 
and state is static. Whereas in the formalism of Moore model, event was associated to 
a transition and action was associated to a state that isn’t static. Both Mealy model 
and Moore model can be implemented in UML statechart. This mixture in UML 
statechart easily leads to errors and low efficiency. This mixture in FREE model is 
forbidden. In section 2, we consider UML statechart that adopt only Mealy model, 
which seems to reduce the expressiveness. But in fact, the actions in states were 
transformed into the actions of the self-transition, it can sufficiently express the 
information of More model. It’s difficult to generate test cases of class directly from 
the UML statecharts diagrams that contain hierarchical and concurrent structure. 
According to our precise semantics, a UML statechart diagram can be transformed 
into FREE state model [6]. To flatten the hierarchical and concurrent structure of 
states in UML statecharts diagram a FREE state model is generated in the transforma- 
tion. A UML model that follows the FREE conventions will be testable [6]. 

FREE State Model. A FREE state model is an extended FSM, a tuple <CONF, conf 0 , 
CT> such that 

• CONF is a set of configurations. 

• conf 0 e CONF is the initial configuration. 

• CT is a set of compound transitions 

The set of configurations CONF1 is the union of the sets of configurations by ap- 
plying the function confset (refer to section 2.2) to each of the states of the UML 
statechart diagram. In section 2.4 we defined the initial configuration conf 0 that con- 
tains tree. top and initial state ( conf 0 ={tree.top , initial}). The set of configurations 
CONF is the union of the set of configurations CONF1 and the initial configuration 
conf 0 . The set of compound transitions CT is calculated by applying schema COM- 
TRAN (refer to section 2.3). 

3.2 Example of FREE State Model 

From a given UML statechart diagram, we can construct a FREE state model which is 
equivalent to the UML statechart. In our example, there are seven configurations in 
Fig. 1 and these constitute CONF of FREE state model in Fig. 2. 

In Fig. 1, six configurations are calculated by applying function confset. These six 
configurations and the initial configuration conf 0 make up the set of configurations 
CONF. In Fig. 1, there are fifteen compound transitions calculated by applying 
schema COMTRAN. The fifteen compound transitions and the initial compound tran- 
sition make up the set of compound transitions CT. 

We give two examples that the compound transition is calculated by applying 
schema COMTRAN : 

A simple transition tl in Fig. 1 corresponds to two compound transitions ctll and 
ctl2 in Fig. 2 because there are two state configurations in state 57, the source state of 
transition tl. The state 57 maps into the set of state configurations { {50, 57, S3, S5, 
S6, S7, 59), {50, 57, S3, 55, 56, 57, 570}} by applying function confset. In this exam- 
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pie, gds=S7, DS={S7}, gas=S8, defaconf (S8, tree)= {55}, AS = p*(gas, tree ) r\de- 
faconf {t. target, tree) = {55}, if souconf = {50, SI, S3, S5, S6, S7, 59}, then tarconf= 
(, souconf \DS ) uA5 =({50, 57, S3, 55, 56, 57, 59}\{57})u{55} = {50, 57, S3, 55, 56, 
55, 59}; if souconf = {SO, SI, S3, S5, S6, S7 , 570}, then tarconf= (souconf \DS) uA5 
=({50, 57, S3, 55, 56, 57, 570}\{S7}) u{S8}={50, 57, S3, 55, 56, 55, 570}. 

As another example, a simple transition r6 in Fig. 1 corresponds to two compound 
transitions ct61 and ct62 in Fig. 2 because there are two state configurations in state 
59, the source state of transition t6. The state 59 maps into the set of state configura- 
tions { {50, 57, S3, S5, S6, 57, 59}, {50, 57, S3, 55, 56, 55, 59} } by applying function 
confset. In this example, gds=Sl, DS={S1, S3, S4, S5, S6, 57, 55, 59, 570}, defaconf 
(S2, tree)= {52}, gas=S2, AS - p*(gas, tree)r\defaconf (t. target, tree ) = {52}, if sou- 
conf = {50, 57, S3, 55, 56, 57, 59}, then tarconf = (souconf XDS) uA5 =({50, 57, 55, 
55, 56, 57, 59}\{57, 55, 54, 55, 56, 57, 55, 59, 570})u{52}={50, 52}; if souconf = 
{50, 57, 55, 55, 56, 55, 59}, then tarconf = (souconf\DS) uA5 =({50, 57, 55, 55, 56, 
55, 59}\{57, 55, 54, 55, 56, 57, 55, 59, 570})u{52}={50, 52}. 




Fig. 2. An example of FREE state model 



3.3 Computation of FREE Models 

The above FREE state model abstracts from the specific environment the UML state- 
chart is interacting with. The environment events that were sent over environment and 
other object were saved event queue evequeue (refer to section 2.4). We use the nota- 
tion s, — Sll: — — >S i+ , to denote that the pair of statuses (s f , s i+1 ) is in relation STEP. 

Definition 3.1 (Computation). A computation for FREE model is an infinite or finite 
sequence of statuses c ( e STATUS (is N), such that: 



Vi: N •(0s:STATUS*(s.curevent =head s t .evequeue 
a s j .actconf= s.actconf 

STEP . o xs. 

A 5 >A ;+| )) 

a Ok: N • (i<k=>0s:STATUS» 

(s.curevent =head s ^evequeue 
a s i .actconf= s.actconf 
STEP . ri xxx 

A 5 >S M ))) 
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a (i=k^>(Bs:STATUS» 

( s.curevent =head s i+1 .evequeue 



a s i+1 .actconf= s.actconf 



AS, 



STEP 



*sm) 



In above definition, the first case of computation is infinite; the second case of 
computation is finite (maximum ke N). This definition captures the interplay with the 

environment: environment steps, which possibly provide new events, alternate with 
steps of the system. This definition records both internally generated events as well as 
those events provided by the environment. In order to understand the computations of 
FREE state model, we let tl, t2, t3, t4, t5, t6 and t7 in Fig.l be elle2, e3le4, e2le4, 
e4/e5 , e4l, e5/e3 and e3le4. The event queue evequeue is initialized with event el. An 
example of the interaction of FREE state model of Fig. 2 is given in Fig. 3. 




Fig. 3. An Example of the Interaction of FREE State Model 



Obviously, computations are mathematical abstractions for test runs. It is highly 
advantageous to the automatic derivation of test cases from UML statecharts. 

4 Conclusions and Further Work 

We give a formal semantics definition for UML statecharts described in [1], This 
paper provides a method of formalizing semantics of UML statecharts with Z, extends 
the definition of firing priorities between the two conflicting compound transitions 
based on [1]. A simple transition in UML statechart corresponds to one or more 
compound transitions in FREE state model if the simple transition is neither an initial 
transition nor a final transition. According to this precise semantics, a UML statechart 
can be transformed into a FREE state model. The hierarchical and concurrent struc- 
ture of states is flattened in the resulting FREE state model. The model helps to de- 
termine whether the software design is consistent, unambiguous and complete. The 
work presented in this paper is beneficial to the development of methods for the 
automatic derivation of test cases from UML statecharts models. The study and de- 
velopment of such methods is a first item for further research. When we will discuss 
the application of statecharts in UML to class testing, we only require discussing 
FREE state model to class testing. 
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Abstract. Being part of domain engineering, domain analysis enables identify- 
ing domains and capturing their ontologies in order to assist and guide system 
developers to design domain-specific applications. Several studies suggest us- 
ing metamodeling techniques for modeling domains and their constraints. How- 
ever, these techniques use different notions, and sometimes even different nota- 
tions, for defining domains and their constraints and for specifying and 
designing the domain-specific applications. We propose an Application-based 
DOmain Modeling (ADOM) approach in which domains are treated as regular 
applications that need to be modeled before systems of those domains are speci- 
fied and designed. This way, the domain models enforce static and dynamic 
constraints on their application models. The ADOM approach consists of three- 
layers and defines dependency and enforcement relations between these layers. 
In this paper we describe the ADOM architecture and validation rules focusing 
on applying them to UML static views, i.e., class, component, and deployment 
diagrams. 



1 Introduction 

Domain Engineering is a software engineering discipline concerned with building 
reusable assets and components in a specific domain [4], [5], [6]. We refer to a do- 
main as a set of applications that use a common jargon for describing the concepts 
and problems in that domain. The purpose of domain engineering is to identify, 
model, construct, catalog, and disseminate a set of software artifacts that can be ap- 
plied to existing and future software in a particular application domain [21]. As such, 
it is an important type of software reuse, verification, and validation [15]. 

Similarly to software engineering, domain engineering includes three main activi- 
ties: domain analysis, domain design, and domain implementation. Domain analysis 
identifies a domain and captures its ontology [26]. Hence, it should specify the basic 
elements of the domain, organize an understanding of the relationships among these 
elements, and represent this understanding in a useful way [4]. Domain design and 
domain implementation are concerned with mechanisms for translating requirements 
into systems that are made up of components with the intent of reusing them to the 
highest extent possible. 

Domain analysis is especially crucial because of two main reasons. First, analysis 
is one of the initial steps of the system development lifecycle. Avoiding syntactic and 
semantic mistakes at this stage (using domain analysis principles) helps to reduce 
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development time and to improve product quality and reusability. Secondly, the core 
elements of a domain and the relations among them usually remain unchanged, while 
the technologies and implementation environments are in progressive improvement. 
Hence, domain analysis models usually remain valid for long periods. 

Several methods and architectures have been developed to support domain analy- 
sis. Some of them rely on Unified Modeling Language (UML) [3] and metamodeling 
techniques [27], for example Catalysis [11]. Using standard notations and techniques 
has many advantages, including accessibility, reliability, and uniformity. However, 
most of the suggested works to domain analysis use different notions, and sometimes 
even different notations, for defining domains and their constraints and for specifying 
and designing applications, weakening the mentioned standardization benefits. Other 
techniques (e.g., [7], [25]) use UML extension mechanisms, more accurately stereo- 
types. Yet, this mechanism provides no formal definition of domain models. 

In this paper we present the Application-based DOmain Modeling (ADOM) ap- 
proach, which enables modeling domains as if they were regular applications. This 
approach enables the validation of domain-specific application models against their 
domain models. The ADOM approach consists of three layers: the application layer, 
the domain layer, and the (modeling) language layer. In the application layer, the 
required application is modeled as composed of classes, associations, collaborations, 
etc. In the domain layer, the domain elements and relations are modeled as if the do- 
main itself is an application. Finally, the language layer includes metamodels of mod- 
eling languages (or methods). We also provide a set of validation rules between the 
different layers: the domain layer enforces constraints on the application layer, while 
the language layer enforces constraints on both the application and domain layers. 
Thus, the contribution of this paper is twofold. First, we provide an approach for 
modeling various aspects of domains and for validating application models against 
domain models. This approach uses a single, standard modeling language, UML, and 
a standard technique, metamodeling. Secondly, applying the ADOM approach to 
UML, we provide a formal framework for defining and constraining stereotypes. 

The structure of the rest of the paper is as follows. Section 2 reviews existing 
works within the domain analysis area, dividing them into single-level and two-level 
domain analysis approaches. Section 3 introduces our three-level ADOM approach. In 
this section, we elaborate on applying ADOM to UML class, component, and de- 
ployment diagrams, exemplifying the approach stages and validation rules on a do- 
main of Web applications and a Web-based glossary application. Finally, Section 4 
summarizes the strengths of this approach and refers to future research plans. 



2 Domain Analysis - Literature Review 

Referring to domain analysis as an engineering approach, Argano [1] suggested that 
domain analysis should consist of conceptual analysis combined with infrastructure 
specification and implementation. Meekel et al. [15] suggested that in addition to its 
static definition, domain analysis may be conceived of as a development process 
which identifies a domain scope, builds a domain model, and validates that model. 
Since the domain keeps evolving as the product users within its scope generate new 
requirements, domain analysis in not a one-shot affair [5], [6]. Gomaa and Kerschberg 
[13] agreed that the domain model lifecycle is constantly evolving via an iterative 
process. Supporting this domain evolution concept, Drake and Ett [10] claimed that 
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domain analysis gives rise to two concurrent, mutually dependent lifecycles that 
should be correlated: the fundamental system lifecycle and the domain lifecycle. 
Becker and Diaz-Herrera [2] proposed that the two concurrent streams are the design 
for reuse (i.e., the domain model) and the design with reuse (i.e., the application 
model). Following this spirit, the Model-Driven Architecture (MDA) [19], which 
originally aimed to separate business or application logic from underlying platform 
technology, observes that system functionality will gradually become more knowl- 
edge-based and capable of automatically discovering common properties of dissimilar 
domains. In other words, the aim of MDA is to eventually build systems in which 
considerable amount of domain knowledge is pushed up into higher abstraction lev- 
els. However, this vision is supported in a conceptual level and not (yet) in a practical 
one. 

Several methods and techniques have been developed to support domain analysis. 
We classify them into two categories: single-level and two-level domain analysis 
approaches. 

2.1 Single-Level Domain Analysis Approaches 

In the single level domain analysis approaches, the domain engineer defines domain 
components, libraries, or architectures. The application designer reuses these domain 
artifacts and can change them in the application model. Meekel et al. [15], for exam- 
ple, propose a domain analysis process that is based on multiple views. They used 
Object Modeling Technique (OMT) [24] to produce a domain-specific framework and 
components. Gomaa and Kerschberg [13] suggest that a system specification will be 
derived by tailoring the domain model according to the features desired in the specific 
system. 

Feature-Oriented Domain Analysis (FODA) [14] defines several activities to sup- 
port domain analysis, including context definition, domain characterization, data 
analysis and modeling, and reusable architecture definition. A specific system makes 
use of the reusable architecture but not of the domain model. 

Clauss [7] suggests two stereotypes for maintaining variability within a domain 
model: «variation point», which indicates the variability of an element, and 
«variant», which indicates the extension part. These stereotypes seems to be weak 
when defining a domain model and validating a specific application model of that 
domain. 

Catalysis [11] is an approach to systematic business-driven development of com- 
ponent-based systems. It defines a process to help business users and software devel- 
opers share a clear and precise vocabulary, design and specify component interfaces 
so they plug together readily, and reuse domain models, architectures, interfaces, 
code, etc. Catalysis introduced two types of mechanisms for separating different sub- 
ject areas: package extension and package template. Package extension allows defini- 
tions of fragments of language to be developed separately and then merged to form 
complete languages. Package templates, on the other hand, allow patterns of language 
definition to be distilled and then applied consistently across the definition of lan- 
guages and their components. Both package extension and package template mecha- 
nisms deal basically with classes and packages and enable renaming of the structural 
elements when reusing them in particular systems. In addition, that work does not 
address the application model validation against its domain model(s). 
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2.2 Two-Level Domain Analysis Approaches 

In the two-level domain analysis approaches, connection is made between the domain 
model and its usage in the application model. Contrary to the single-level domain 
analysis approaches, the domain and application models in the two-level domain 
analysis approaches remain separate, while validation rules between them are defined. 
These validation rules enable avoiding syntactic and semantic mistakes during the 
initial stages of the application modeling, reducing development time and improving 
system quality. Petro et al. [20], for example, present a concept of building reusable 
repositories and architectures, which consist of correlated component classes, connec- 
tions, constraints, and rationales. When modeling a specific system, the system model 
is validated with respect to the domain model in order to check that no constraint has 
been violated. 

Schleicher and Westfechtel [25] discuss static metamodeling techniques in order to 
define domain specific extensions. They divide these extensions into descriptive 
stereotypes for expressing the elements of the underlying domain metamodel, restric- 
tive stereotypes for attaching constraints to stereotyped model elements, regular 
metamodel extensions, and restrictive metamodel extensions. They mostly deal with 
packages and classes, but not with behavioral elements. Furthermore, the semantics 
and constraints of the stereotypes used in this work are expressed in a natural lan- 
guage, weakening the formality of this approach. 

Gomma and Eonsuk-Shin [12] suggest a multiple view metamodeling method for 
software product lines. They solve model commonalty and variability problems 
within the product line domain by defining special stereotypes which are used in the 
use case, class, collaboration, statechart, and feature model views. These stereotypes 
are modeled in the metamodel level by class diagrams, while the relations among 
them are specified in Object Constraint Language (OCL) [028]. The main shortcom- 
ing of this method is in using a new dialect of UML for modeling the domain ele- 
ments and constraints (e.g., adding alternating paths). 

Morisio et al. [16] propose an extension to UML that includes a special stereotype 
indicating that a class may be altered within a specific system. The extension is dem- 
onstrated by applying it to UML class diagrams. The validation of an application 
model with respect to its domain model entails checking whether a class appears in 
the application model along with its associate classes, but not if the class is correctly 
connected. 

The Institute for Software Integrated Systems (ISIS) at Vanderbilt University sug- 
gested a metamodeling technique for building a domain-specific model using UML 
and OCL [17]. The application models are created from the domain metamodel, ena- 
bling validation of their consistency and integrity in terms of the domain analysis [9] . 
However, the domain models are specified using UML class diagrams and OCL, 
while the application models use other notations, conceding the benefits of applying a 
standard modeling language to the application models as well. 



3 The Application-Based Domain Modeling (ADOM) Approach 

Application models and domain models are similar in many aspects. An application 
model consists of classes and associations among them and it specifies a set of possi- 
ble behaviors. Similarly, a domain model consists of core elements, static constraints, 
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and dynamic relations. The main difference between these models is in their abstrac- 
tion levels, i.e., domain models are more abstract than application models. Further- 
more, domain models should be flexible in order to handle commonalities and differ- 
ences of the applications within the domain. 

The classical framework for metamodeling is based on an architecture with four 
abstraction layers [18]. The first layer is the information layer, which is comprised of 
the desired data. The model layer, which is the second layer, is comprised of the 
metadata that describes data in the information layer. The third metamodel layer is 
comprised of the descriptions that define the structure and semantics of metadata. 
Finally, the meta-metamodel layer consists of a description of the structure and se- 
mantics of meta-metadata (for example, metaclasses, metaattributes, etc.). Following 
this general architecture, we divide our Application-based DOmain Modeling 
(ADOM) approach into three layers: the application layer, the domain layer, and the 
(modeling) language layer. The application layer, which is equivalent to the model 
layer (Ml), consists of models of particular systems, including their structure 
(scheme) and behavior. The domain layer, i.e., the metamodel layer (M2), consists of 
specifications of various domains. The language layer, which is equivalent to the 
meta-metamodel layer (M3), includes metamodels of modeling languages. The mod- 
eling languages may be graphical, textual, mathematical, etc. In addition, the ADOM 
approach explicitly enforces constraints among the different layers: the domain layer 
enforces constraints on the application layer, while the language layer enforces con- 
straints on both the application and domain layers. 

Figure 1 depicts the architecture of the ADOM approach. The application layer in 
this figure includes three examples of applications: Amazon, which is a Web-based 
book store, eBay, which is an auction site supported by agents, and Kasbah, which is 
a multi-agent electronic marketplace. Each one of these systems may have several 
models in different modeling languages. The domain layer in Figure 1 includes two 
domains: Web applications and multi agent systems, while the language layer in this 
example includes only one modeling language, UML. Since UML is the current stan- 
dard (object-oriented) modeling language, we apply the ADOM approach to UML. 

Figure 1 shows also the relations between the layers. The black arrows indicate 
constraint enforcement of the domain models on the application models, while the 
grey arrows indicate constraint enforcement of the language metamodels on the appli- 
cation and domain models. 

The rest of this section elaborates on the domain and application layers, while the 
language layer is restricted to the UML metamodel [3] except of two minor changes: 

1 . A model element (e.g., attribute, operation, message, etc.) has an additional fea- 
ture, called "multiplicity", which represents how many times the model element 
can appear in a particular system. This feature appears as «min..max» before a 
relevant domain element in a domain model, while «1..1» is the default (and, 
hence, does not appear). 

2. A model element can have several stereotypes, which are separated by commas. 

3.1 Applying UML Structural Views to the Domain Layer 
of the ADOM Approach 

When referring to the static views of a domain, the domain engineer can use UML 
class, component, and deployment diagrams for specifying the domain elements and 
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Fig. 1. The Application-based DOrnain Modeling (ADOM) architecture 



the (structural) constraints among them. In what follows, we demonstrate the ADOM 
approach on a part of the Web application domain as defined by Conallen [8]. 
Figure 2 cites a definition of a server page given by Conallen. 



1 . A server page represents a web page that has scripts which are executed by the server. 

2. These scripts interact with resources on the server (databases, business logic, external 
systems, etc). 

3. The object’s operations represent the functions in the script, and its attributes represent 
the variables that are visible in the page’s scope (accessible by all functions in the page). 

4. Server pages can only have relationships with objects on the server. 

Fig. 2. A part of Conallen's specification for the Web application domain - A Server Page 
definition 

As can be seen, the definition in Figure 2 includes logical and physical elements 
(classes, components, and nodes). Hence, modeling this particular domain element, a 
server page, requires UML class, component, and deployment diagrams. 

Figure 3 is a partial class diagram that models the logical aspects of a server page: 
A server page is specified as a class the attributes of which are classified as vari- 
ables. A server page may have any number (including 0) of variables which can be 
of any type recognized in UML. These constraints are modeled in the domain model 
as the attribute "«0..m» variable: anyType" of the server page class. Since these 
variables are visible only within the server page's scope (including its scripts), their 
scope is defined to be "package" in the domain model. The order of scopes (from the 
least restricted to the most restricted) is public, package, protected, and private. A 
scope of a model element defined in a domain model is the least restricted scope that 
this element can get in any application model of that domain 1 . In particular, a variable 
scope within an application model can be package, protected, or private. 



1 Enforcing a specific scope on a model element (e.g., public) can be done by defining an OCL 
constraint. 
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Fig. 3. A partial class diagram of a Server Page within the Web application domain 
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Fig. 4. A partial merged component and deployment diagram describing the physical con- 
straints on a Server Page within the Web application domain 

Figure 3 also specifies that a server page may have any number of operations re- 
gardless of their signatures as indicated by "«0..m» anyMethod («0..m» any- 
Parameter :anyType): anyType" declaration. All the operations of a server page (as 
all the operations in this domain model) are defined as public in the domain model 
and, hence, their scopes are not limited in the application models, i.e., they can be 
public, package, protected, or private. A server page may have relations with any 
class (on the server, as will be constrained next), as indicated by the association be- 
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tween server page and anyClass. In addition, a server page may aggregate any 
number of scripts. A script has any number of operations regardless of their signa- 
tures, may have any relations with other scripts, as indicated by the self association 
labeled anyRelation, and interacts with any number of resources (on the server), as 
indicate by the dependency relation labeled "interacts with". 

Similarly to scopes, the ADOM approach defines a precedence order between rela- 
tions. The most general relation is an association, followed by a navigational associa- 
tion, an aggregation, a navigational aggregation, a composition, and a navigational 
composition. A relation specified in a domain model is the most general relation pos- 
sible between the two model elements in any application model of that domain. En- 
forcing a specific relation type (e.g., aggregation) requires definition of a new type of 
an OCL constraint. 

Figure 3 does not limit the structure of a resource element, i.e., it may have any 
attributes, any operations, and any relations with other resources. However, this 
figure defines the hierarchy of resources: a resource is specialized into database, 
business logic, and external system, each of which is a special type of resources. 

Figure 4 presents a component diagram merged into a deployment diagram. The 
merged diagram expresses the physical constraints of the domain on server pages. The 
main domain node is a server from which at least one physical node exists as indi- 
cated by the multiplicity feature («1..m»). The server hosts at least one resource 
component and at least one server page component. It may also host components 
of any type each of which implements at least one class (of any type). A resource 
component implements at least one resource class and may implement any number 
of other resource classes, i.e., business logic, database, and/or external system. A 
server page component implements at least one server page class and any number 
(including 0) of script classes. Figure 4 also defines dependency constraints among 
components: a server page component depends on at least one resource compo- 
nent and may depend on other components of any type, including other server page 
components. 



3.2 Applying UML Structural Views to the Application Layer 
of the ADOM Approach 

An application model uses a domain model as a validation template. All the con- 
straints enforced by the domain model should be applied to any application model of 
that domain. In order to achieve this goal, any element in the application model is 
classified according to the elements declared in the domain model using UML stereo- 
type mechanism. As defined in UML user guide [3], a stereotype is a kind of a model 
element whose information content and form are the same as the basic model element, 
but its meaning and usage are different. The ADOM approach requires that a model 
element in an application model will preserve the relations of its stereotypes in the 
relevant domain model(s). 

Returning to our example of the Web application domain, we describe in this sec- 
tion a partial model of a Web-based glossary application (GLAP) in that domain. The 
GLAP system [8] provides an online version of a software development project’s 
glossary of terms. The project’s team members can access the database of terms, us- 
ing a common Web browser. Team members may also update, add entries to the data- 
base, and remove entries from it, using the same browser interface. Figure 5 is a par- 
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tial class diagram of the GLAP system. Following the server page definition in the 
Web application domain, shown in Figure 3, the GLAP system defines two types of 
server pages: process search, which uses the glossary API to search the glossary for 
words (or descriptions) that match a string, and edit entry, which builds an edit page 
for a specific entry in the glossary. Process search consists of writeEntry (classified 
as a script) and getEntries (also classified as a script). It also has four variables (at- 
tributes): searchWord, searchDescription, nl (the new line string), and message- 
Word (the word searched for, modified for use as a hyperlink parameter). All the 
variables of process search are of type string. The Edit entry server page consists of 
getEntry (classified as a script) and has three variables (id, word, and description). 
The getEntries script consists of a getEntry script. The writeEntry and getEntries 
scripts interact with the glossary DB (classified as a database), which in turn consists 
of many glossary entries (classified as "database" elements). 




Fig. 5. A partial class diagram of the GLAP system - A description of process search and edit 
entry server pages 



The ADOM approach validates the structure of each application class and the rela- 
tions among them using the domain model. Table 1 summarizes the domain con- 
straints of the Web application elements, and how these are correctly fulfilled in the 
class diagram of the GLAP system. For each domain class, the table lists its features 
(in the "Feature Name" column), scope or relation type constraints (in the "Feature 
Constraint" column), and multiplicity constraints (in the "Allowed Feature Multiplic- 
ity" column). In addition, the table summaries the actual features of each class in the 
application model (in the "Actual Features" column). As can be seen, none of the 
constraints expressed in the domain model, shown in Figure 3, are violated by the 
application model, specified in Figure 5. 

Figure 6 depicts the implementation view of the GLAP system. This diagram fol- 
lows the guidelines of the Web application domain for components and their deploy- 
ment as expressed in Figure 4. The ADOM approach validates the existence of the 
defined classes and their associations to components and nodes. It also validates the 
relationships among the various model elements and their multiplicities. 
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Table 1 . The Web application domain constraints and their fulfillment in the GLAP system 
model - comparing the class diagrams 



Class 


Feature 

Name 


Feature 

Constraint 


Allowed 

Feature 

Multiplicity 


Actual Features 


Server 

page 


variable 


Max scope: 
package 


0..°° 


• 4 package variables for process 
search 

• 3 package variables for edit entry 


general 

operation 


Max scope: 
public 


0..°° 


• 1 public operation for each server 
page (process search & edit entry) 


relation to 
script 


Type: 

navigational 

aggregation 


0..°° 


• 2 navigational aggregations for 
process search (writeEntry and getEn- 
tries) 

• 1 navigational aggregation for edit 
entry (getEntry) 


relation to 
any class 


Type: 

association 


0..°° 


• 0 relation to other classes for both 
process search & edit entry 


script 


general 

attribute 


None 


0 


• 0 attributes for both process search 
and edit entry 


general 

operation 


Max scope: 
public 


0..°° 


• 0 public operations for each script 


relation to 
script 


Type: 

association 


0..°° 


• 1 navigational aggregation for 
getEntries 

• 0 relations for the other scripts 


dependency 
to resource 


None 


0..°° 


• 1 dependency relation for each script 


resource 


general 

attribute 


Max scope: 
private 


0..°° 


• 0 attributes for glossary DB 

• 3 private attributes for glossary entry 


general 

operation 


Max scope: 
public 


0..°° 


• 0 public operations for each resource 


relation to 
resource 


Type: 

association 


0..°° 


• 1 aggregation for glossary DB 

• 0 relations for the other resources 



Table 2 summarizes the physical constraints of the Web application domain (speci- 
fied in Figure 4) and shows that none of them is violated by the GLAP system. For 
each component or node, the domain constraints (the "Feature Constraints" column) 
and the relevant features in the application model (the "Actual Features" column) are 
listed side-by-side. 



4 Summary and Future Work 

The Application-based DOmain Modeling (ADOM) approach enables domain engi- 
neers to define structural and behavioral constraints that are applicable to all the sys- 
tems within a specific domain. When developing a system in ADOM, its domain (or 
domains) is first defined in order to enforce domain restrictions on particular systems. 
Then, the application models are validated against their domain models in order to 
detect semantic errors in early development stages. These errors cannot be automati- 
cally found when using syntactic modeling language alone. 
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«server» 
GLAP server 



« server page component» 
process search component 

«server page» process search 
«script» writeEntry 
«script» getEntries 



«server page component» 
edit entry 



«server page» edit entry 
«script» getEntry 



«resource component» 
glossary DB component 

«database» glossary DB 
«database»glossary entry 



Fig. 6. A partial merged component and deployment diagram of the GLAP system - Allocating 
the process search and edit entry server pages into components and nodes 

Two major techniques are usually used when applying UML to the domain analy- 
sis area: stereotypes and metamodeling. The main limitation of the stereotypes tech- 
nique is the need to define the basic elements of a domain outside the model via a 
natural language, as was done, for example, by Conallen for the Web application 
domain [8]. While using natural languages is more comprehensible to humans, it 
lacks the needed formality for defining domain elements, constraints, and usage con- 
texts. The ADOM approach enables modeling the domain world in a (semi-) formal 
UML model. This model is used to validate domain-specific application models. 

While applying a metamodeling technique, the basic elements of the domain and 
the relations among them are modeled. Usually, the domain and application models 
are specified using different notions (and even different notations). In the case of 
UML, the domain models are expressed using class diagrams, while the application 
models are expressed using various UML diagram types. This unreasonably limits the 
expressiveness of domain models. In the ADOM approach, the domain and applica- 
tion models are specified using the same notation and ontology. In other words, the 
ADOM approach enables specification of physical and behavioral constraints in the 
domain level (layer). Furthermore, keeping the same notation and ontology for the 
entire development team (which includes domain engineers and system engineers) 
improves collaboration during the development process. 

In this paper, we applied the ADOM approach to UML static views. In [22], we 
have also applied the ADOM approach to UML interaction diagrams. In the future, 
we plan to develop a domain validation tool that will check a system model against its 
domain model and will even guide system developers according to given domain 
models. An experiment is planned to classify domain-specific modeling errors when 
using the ADOM approach and other domain analysis methods. This experiment will 
also check the adoption of several different domains within the same application util- 
izing the ADOM approach. 
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Table 2. The domain constraints and their fulfillment in the GLAP system model - comparing 
the component and deployment diagrams 



Component/ 

Node 


Feature Constraints 


Actual Features 


server 


At least one node 


One server called GLAP server 


Includes at least one 
resource component 


One resource component called glossary DB 
component 


Includes at least one 
server page component 


Two server page components called process 
search component and edit entry component 


Includes 0 or more com- 
ponents of any type 


No other components 


server page 
component 


Includes at least one 
server page class 


• The process search component includes 
one server page class (process search) 

• The edit entry component includes one 
server page class (edit entry) 


Includes 0 or more script 
classes 


• The process search component includes 
two script classes (writeEntry and getEntries) 

• The edit entry component includes one 
script class (getEntry) 


Depends on at least one 
resource component 


• The process search component depends on 
one resource component (glossary DB com- 
ponent) 

• The edit entry component depends on one 
resource component (glossary DB compo- 
nent) 


Depends on 0 or more 
server page components 


• The process search component depends on 
one server page component (edit entry com- 
ponent) 

• The edit entry component does not depend 
on other server page components 


Depends on 0 or more 
other components of any 
type 


• Neither the process search component nor 
the edit entry component depends on other 
components 


resource 

component 


Includes at least one 
resource class 


The glossary DB component includes two 
resource classes of type database (glossary DB 
and glossary entry) 


Includes 0 or more busi- 
ness logic classes 


The glossary DB component includes 0 busi- 
ness logic classes 


Includes 0 or more data- 
base classes 


The glossary DB component includes two 
database classes (glossary DB and glossary 
entry) 


Includes 0 or more exter- 
nal system classes 


The glossary DB component includes 0 ex- 
ternal system classes 
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Abstract. In this paper we propose a methodological approach for the devel- 
opment of XML databases. Our proposal is framed in MIDAS, a model driven 
methodology for the development of Web Information Systems (WISs) based 
on the Model Driven Architecture (MDA) proposed by the Object Management 
Group (OMG). So, in this framework, the proposed data Platform Independent 
Model (PIM) is the conceptual data model and the data Platform Specific 
Model (PSM) is the XML Schema model. Both of them will be represented in 
UML, therefore we also summarize in this work an extension to UML for XML 
Schema. Moreover, we define the mappings to transform the data PIM into the 
data PSM, which will be the XML database schema. The development process 
of the XML database will be shown by means of a case study: a WIS for the 
management of medical images stored in the XML DB of Oracle. 

Keywords: XML Database Development, XML Schema, UML, Mappings, Da- 
tabase Design, Model Driven Engineering. 



1 Introduction 

The development of a database (DB) depends on different features. On the one hand it 
depends on the previous existence of the DB or if it is necessary to start from scratch. 
On the other hand we have to take into account the selected data repository, that is, if 
we want to use an object-relational (OR) DB or an XML DB. We have proposed a 
general framework and a specific process for each of the cases when modeling a Web 
DB taking into account the previous features; in [13] we have defined a methodologi- 
cal approach for the development of OR DBs, including the proposed tasks, models, 
notations and mapping rules to obtain the final implementation of the OR DB in a 
product (Oracle); and, in this paper we go deeply into the development of XML DBs. 
XML [3] is the current standard for the information exchange and data transportation 
between heterogeneous applications. Traditionally the XML documents information 
were stored in conventional DB systems, but now the XML DBs are emerging as the 
best alternative to store and manage XML documents. There exist different solutions 
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for the XML documents storage, which could be roughly categorized according to 
[20] into two main groups: native XML DBs like Tamino [17], X-Hive/DB [21], 
eXcelon XIS [7], eXist [6] or ToX [2]; and XML DB extensions enabling the storage 
of XML documents within conventional, usually relational or object-relational Data- 
Base Management Systems (DBMSs) like Oracle [16], which includes since version 
9 i new features, collectively referred to as Oracle XML DB, IBM DB2 XML Ex- 
tender [8] or Microsoft SQLXML [14], In [20] a study of different XML DB solutions 
is made. 

Nonetheless, good technology is not enough to support complex XML data and 
applications. It is necessary to provide methodologies that guide designers in the 
XML DB design task, in the same way as it has been traditionally done with relational 
or object-relational DBs [5], There are a few works in this research line, for example, 
[9] proposes some rules to obtain an XML Schema from a UML class diagram, but 
unfortunately, this proposal is not included in a methodological framework and does 
not give specific guidelines for the design of XML DBs. So, in spite of existing a lot 
of XML DB solutions as we have mentioned before, to the best of our knowledge, 
there is no methodology for the systematic design of XML DBs. 

For this reason, in this work we show a methodological approach for the develop- 
ment XML DBs in the framework of MIDAS, a model driven methodology for the 
development of Web Information Systems (WISs). In our approach the proposed data 
Platform Independent Model (PIM) is the conceptual data model and the data Plat- 
form Specific Model (PSM) is the XML Schema model, which will be defined both in 
UML. For this purpose, we summarize the extension to UML for representing XML 
Schemas, based on the preliminary work presented in [18]. Moreover, we propose the 
mappings to transform the data PIM into the data PSM in XML Schema. The obtained 
data PSM will be the XML DB. 

The rest of the paper is organized as follows: section 2 is an overview of the 
MIDAS framework, including its model driven architecture and its process; in section 
3 we focus on the XML DB development in MIDAS, showing the specific part of the 
process, the summarized UML extension and the mappings to obtain the data PSM 
from the data PIM; in section 4 we present part of a case study for the management of 
medical images. We focus on the development of the XML DB, showing the obtained 
data PIM and data PSM, as well as a small part of the final implementation in Ora- 
cle’s XML DB; finally, section 5 sums up the main conclusions as well as the future 
work. 



2 MIDAS Framework 

MIDAS [12] is a model driven methodology for the development of WISs with an 
incremental and iterative process model based on prototyping. It also proposes some 
techniques based on agile methodologies, as for example, eXtreme Programming. Its 
architecture is based on the Model Driven Architecture (MDA) proposed by the Ob- 
ject Management Group (OMG) [15]. The methodology specifies some Computation 
Independent Models, PIMs, PSMs and mappings between them. The MIDAS archi- 
tecture (see figure 1) considers the aspects of content, hypertext and behavior at the 
PIM and PSM levels to model the system [4], In this work we focus on the content 
aspect, where the used data PIM is the conceptual data model and the used PSM is the 
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OR or the XML Schema model. The hypertext and behavior aspects are out of the 
scope of this paper and are described in detail in other works [10]. 

MIDAS proposes to use standards in the development process, and, therefore, it is 
based on UML, XML and SQL: 1999. UML is used to model the whole system in a 
unique notation. As UML provides some extension mechanisms, it can be extended to 
be used for all the necessary techniques when modeling a WIS. MIDAS proposes to 
use some of the existing extensions for Web modeling and it also defines new ones 
whenever it is necessary, for example, the UML extension for OR DB design [11,13] 
or the UML extensions for XML technology, including one for XML Schemas [18]. 




Fig. 1 . MIDAS Architecture 

MIDAS process defines different steps and at the end of each step a new version of 
the product is obtained. In the first step the user requirements and the architecture 
have to be defined. In the second step, MIDAS/PR, a first PRototype of the WIS is 
developed with static Web pages. This prototype permits validating the specified user 
requirements with the customer and obtaining a first version of the product in a short 
time. In the third step, MIDAS/ST, the STructural dimension of the WIS is carried 
out. Taking the first version of the hypertext obtained in the previous step, a new 
version of the Web hypertext is implemented using XML technology. The dynamic 
Web pages in XML extract the information from the DB, which is also built in this 
step. There are also another step, MIDAS/BH, to model the services and the BeHav- 
iour of the WIS, as well as another one for testing the system. Each step defines the 
specific tasks, models and notations to be carried out. 

This work deals specifically with the MIDAS/ST step of the MIDAS methodology, 
where the structural dimension of a WIS is carried out, which includes the content 
aspect. This aspect, on which we focus in this paper, corresponds to the traditional 
concept of a DB. Figure 2 shows the specific process for the development of the DB 
(XML or OR) and its specific tasks: at the PIM level the conceptual data design task 
has to be carried out and at the PSM level the logical data design task has to be real- 
ized using XML or OR technology. A detailed description of the MIDAS process and 
its tasks, models and notation can be found in [4,12,13]. 



3 XML Database Development 

In this work we concentrate our attention on the XML database development in the 
MIDAS/ST step of the MIDAS methodology, which is responsible for the develop- 
ment of the structural dimension of a WIS. This dimension includes the content as- 
pect, which corresponds to the traditional concept of a DB. The development of the 




A Model Driven Approach for XML Database Development 783 




Fig. 2. MIDAS Process for the DB development 



DB can be carried out in different ways: case a) there exists already a DB in the or- 
ganization and we have to integrate it with the WIS; case b) we want to use an OR 
DB; case c) we want to use an XML DB. 

In this section we will show how to carry out the XML DB development starting 
from scratch, including the necessary techniques and models. We will see at first the 
specific development process for XML DBs, then we will sum up the UML extension 
to represent XML Schemas based on the preliminary work presented in [18], and 
finally, we will show the mappings to transform a data PIM into a data PSM defined 
with an XML Schema. 



3.1 XML Database Development Process 

The proposed process for the development of XML DBs includes several activities, 
but we will show just the specific ones for XML DBs development: analysis, design 
and implementation. 

• The analysis activity is independent of the way in which the DB is developed and 
the used technology. The data PIM obtained in the previous step MIDAS/PR will 
be refined taking into account new user requirements and the feedback provided 
by the use of the prototype obtained at the end of the previous step. The data PIM 
will be represented with a UML class diagram. 

The tasks that are carried out in the design and implementation activities depend 
mainly on the way we want to develop the DB, but we will only show the activities 
related with the development of an XML DB. 

• In the design activity we have to obtain the logical data design. The logical design 
of the DB is carried out starting from the data PIM obtained in the analysis activ- 
ity of the current step. From this data PIM we obtain the XML Schema at the PSM 
level represented in extended UML for XML Schemas, summed up in the next 
section 3.2, applying the mappings defined in section 3.3. The obtained XML 
Schema is the logical design of the DB. 
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• The implementation activity includes the implementation of the DB, using an 
XML DB to store the obtained XML Schemas. 



3.2 Representing XML Schemas with Extended UML 

In MIDAS we propose to use the XML Schema model as the data PSM. An XML 
Schema [19] is the definition of a specific XML structure. The W3C XML Schema 
language specifies how each type of element is defined in the schema and the data 
type that the element has associated. The XML Schema itself is a special kind of 
XML document that is written according to the rules given by the W3C XML Schema 
specification. These rules constitute a language, known as the XML Schema defini- 
tion language; for a detailed description see [19]. 

The proposed notation to represent the XML Schema model is extended UML. So, 
next we sum up our UML extension to represent XML Schemas, which is based on 
the preliminary work presented in [18]. The extension defines a set of stereotypes, 
tagged values and constraints that enable us to represent in graphical notation in UML 
all the components of an XML Schema, keeping the associations, the specified order 
and nesting between them. 

According to the proposed UML extension, an XML schema is represented by 
means of an UML package stereotyped with «Schema», which will include all the 
components of the XML schema. The name of the schema will be the name of the 
package. The attributes of the XML Schema will be tagged values of the package. 

The XML elements are represented with stereotyped classes named as the value of 
the attribute name of the element. The attributes of the element will be tagged values 
of the class. The appearance order of the element in the XML Schema, including as a 
prefix the order number of the element to which it belongs, will also be a tagged value 
of the class and it will be represented next to the name of the class. 

The XML attributes are represented by means of UML attributes of the class that 
represent the XML element to which the XML attributes belong to. The base type of 
an XML attribute will be represented as the data type of the corresponding UML 
attribute. The constraints to be satisfied by the attribute (required, optional) and the 
default or fixed value will be represented as tagged values. 

A compositor composition is a special kind of composition stereotyped with the 
kind of compositor: «Choice», «Sequence» or «A11». It can only be used to 
join an element (composite) with the elements that compose the father element 
(parts). The compositors can be used to represent nameless XML complexTypes. 

The XML complexTypes have been considered as stereotyped classes with 
«complexType», if they are named. In this case, the complexType will be related 
by means of a uses association with the element, complexType or simpleType that 
uses it. If the complexType has no name, it will be represented in an implicit way by 
the compositor that composes the complexType. 

The XML simpleType is a type that has no subelements or attributes. The simple- 
Types have been considered as classes stereotyped with «simpleType» named as 
the element that contains it. It will be related with its father element with a stereo- 
typed composition with «simpleType». 

The XML complexContent is a subclass of the complexType that it defines. The 
complexContent types have been considered as stereotyped classes, which must be 
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related by an inheritance association with the elements or complexTypes that the 
complexContent redefines. 

The XML simpleContent is a subclass of the complexType or a simpleType. The 
simpleContent types have been considered as stereotyped classes that are related with 
an inheritance association to the father type (simple or complex type) that is redefined 
by the simpleContent type. 

A uses association is a special kind of unidirectional stereotyped association 
which joins a named complexType with the element or type (simple or complex) that 
uses it. A «uses» association can also be used to join two elements by means of a 
ref attribute in one of the elements. The direction of the association is represented by 
an arrow at the end of the element, which is used by the one that contains the corre- 
sponding ref element. 

A REF element will be represented by means of an attribute stereotyped with 
«REF» and represents a link to another element. A REF attribute can only refer to 
a defined element and is associated with the referred element by means of a uses as- 
sociation. 

In figure 3 the metamodel of the UML extension for XML Schemas is shown. 




Fig. 3. Metamodel of the UML extension for XML Schemas 



3.3 Mappings to Obtain the Data PSM from the Data PIM 

In this section we are going to describe the mappings defined to build the data PSM 
from the data PIM. There exist some other works [9], in which some rules are defined 
to obtain XML Schemas from the UML class diagram, but, to our knowledge, none of 
these proposals give specific guidelines for the design of XML DBs. 

We will start from the data PIM represented with a UML class diagram and will 
obtain the data PSM in XML, also represented in extended UML applying the follow- 
ing mapping rules: 
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• The complete conceptual data model is transformed, at the PSM level, into an 
XML schema named ‘ Data PSM’ , including all the components of the data PIM. 
It will be represented with a UML package stereotyped with «SCHEMA» and 
named ‘ Data PSM’ . This package will include the components of the data PSM. 

• Transformation of UML classes 

We can split the UML classes into different groups: subclasses of a generaliza- 
tion, classes that represent parts of a composition and finally, the rest of the 
classes. 

o The first and second groups of UML classes (subclasses and parts) will be 
transformed by means of named complexTypes. 
o In the third group the classes will be transformed into an element, named as the 
class name. The abstract classes are mapped into abstract elements. 

The complexTypes generated when transforming UML classes belonging to the 
first and second group, will be represented in extended UML by means of stereo- 
typed classes with «complexType» and named with the name of the subclasses 
or parts plus “_type”. The elements generated when transforming the UML classes 
belonging to the third group will be represented in extended UML with a class 
stereotyped with «ELEMENT» and named as the class of the data PIM from 
which it comes. 

• Transformation of UML attributes 

In order to map the UML attributes of the classes, we can transform them in two 
different ways: into XML attributes or into XML elements. A straightforward 
mapping is to transform the UML attributes into XML attributes of the element 
that represents the class. However, in this way the attributes can no be used if they 
are not single valued, as for example, multivalued or composed ones. Moreover, 
attributes are usually used to describe the content of the elements and not to be 
visualized. For these reasons, and as the UML attributes are represented as classes 
in the UML metamodel, we propose to transform the UML attributes of a class by 
means of a complexType including as subelements the UML attributes of the 
class. 

This complexType will be represented in extended UML with composition stereo- 
typed with «sequence», which includes all the attributes as subelements repre- 
sented with classes stereotyped with «ELEMENT». 

The attributes of the class can be transformed according to their type: 

o A mandatory attribute will be represented with a minimum multiplicity of one 
at the composition, whereas an optional attribute will be represented with a 
minimum multiplicity of zero at the composition, 
o A multivalued attribute will be represented with a maximum multiplicity of N 
in the part side of the composition. 

o A composed attribute will be represented by an element that is related to the 
composing attributes by means of a complexType 
o An enumerated attribute will be represented by a simpleType composition, 
with the stereotyped restriction Enumeration, 
o A choice attribute will be represented by a simpleType composition, with the 
stereotyped restriction Choice. 
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Transformation of associations 

There exist two main aspects to take into account when transforming UML asso- 
ciations into an XML Schema. The first one deals with the direction in which the 
associations are implemented, that is, if they are unidirectional or bidirectional. 
The second one deals with the way in which the associations are mapped using 
XML Schema constructions. 

With regard to the first aspect, UML associations can be represented in an XML 
schema either as unidirectional or as bidirectional associations. A unidirectional 
association means that it can be crossed only in one direction whereas a bidirec- 
tional one can be crossed in the two directions. If we know that queries require 
data in both directions of the association, then it is recommended to implement 
them as bidirectional associations improving in this way the response times. How- 
ever, we have to take into account that bidirectional associations are not main- 
tained by the system, so the consistence has to be guaranteed in a manual way. 
Therefore bidirectional associations, despite of improving in some cases the re- 
sponse times, have a higher maintenance cost. The navigability (if it is repre- 
sented) in a UML diagram shows the direction in which the association should be 
implemented. 

With regard to the second aspect, the way in which an association could be 
mapped to XML schema is a crucial issue and there are different ways of trans- 
forming UML associations into XML Schema associations within a XML docu- 
ment, each with its advantages and disadvantages. Some criteria to select the best 
alternative are related to the kind of information, the desirable level of redun- 
dancy, etc. In [9] a study of the different alternatives is made. We propose to 
model the associations by adding association elements within the XML elements 
that represent the classes implicated in the association. Next, we show how to map 
the association in a unidirectional way using ref elements. 

o One-to-One. A one-to-one association will be mapped creating an association 
subelement of one of the elements that represent the classes implicated in the 
association. The subelement will be named with the association name. This 
subelement will include a complexType with an element of a ref type that ref- 
erences the other element implicated in the association. If the minimum multi- 
plicity is one, the attribute minOccurs will be one, which is the default value 
and can be omitted, and otherwise it will be zero. As the maximum multiplicity 
is one, the attribute maxOccurs has to be one, which is the default value too 
and can be omitted. 

o One-to-Many. A one-to-many association will be transformed in a unidirec- 
tional way creating an association subelement within the element that repre- 
sents the class with multiplicity N, named as the association, including a com- 
plexType within this element with a subelement of a ref type that references 
the other element implicated in the association. If the minimum multiplicity is 
one, the attribute minOccurs will be one, which is the default value and can be 
omitted, and otherwise it will be zero. As the maximum multiplicity is one in 
this direction, the attribute maxOccurs has to be one, which is the default value 
and can also be omitted. 

o Many-to-Many. Following the same reasoning as in the previous case, a 
many-to-many association will be transformed defining an association element 
within one of the elements, including a «sequence» complexType of refer- 
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ence elements to the collection of elements implicated in the association. If the 
minimum multiplicity is one, the attribute minOccurs will be one, which is the 
default value, otherwise it will be zero. As the maximum multiplicity is N in 
this direction, the attribute maxOccurs has to be N. 

• Transformation of aggregations 

An aggregation will be mapped creating a subelement of the element which repre- 
sents the aggregate named as the aggregation. If the aggregation has no name, the 
name will be “is_aggregated_of ’. This element (aggregate) will include a com- 
plexType with an element of ref type that references the parts of the aggregation. 
If the maximum multiplicity is N, the complexType will include a sequence of 
references. If the minimum multiplicity is one, the attribute minOccurs will be 
one, which is the default value, otherwise it will be zero. 

It will be represented in extended UML by means of an aggregation stereotyped 
with «all». 

• Transformation of compositions 

A composition of classes in the UML class diagram will be mapped including as 
subelements in the element that represents the compositor the parts of the compo- 
sition. The subelements will be of the type of the complexType defined to repre- 
sent the parts of the composition. 

It will be represented in extended UML including in the compositor a part of the 
stereotyped composition and a uses association, which relates the part with the 
corresponding complexType of the part. 

• Transformation of generalizations 

A generalization of classes in the UML class diagram will be mapped including 
the superclass as an element and a complexType of choice type which includes as 
subelements of a complexType the subclasses of the generalization. 

It will be represented in extended UML by means of a composition stereotyped 
with «choice». 

The proposed mapping rules to obtain the data PSM are summarized in the table 1 . 



4 A Case Study 

The tasks, models and mappings proposed in MIDAS are being defined by means of 
different case studies. In this paper we present part of the case study of a WIS for 
medical image management. This WIS is based on DICOM (Digital Image and 
Communications in Medicine) [1], which is the most accepted standard for the medi- 
cal image exchange. In this paper, the presented case study will only focus on the 
development of the XML DB. We will show how to build it starting from a concep- 
tual data model. In section 4.1 we present the data PIM and the section 4.2 presents 
the data PSM, showing how to apply the proposed mappings to obtain it. Finally, 
section 4.3 shows the XML database implementation in Oracle’s XML DB. 



4.1 Data PIM 

For the sake of brevity we will only present a reduced part of the data PIM obtained in 
the analysis activity of MIDAS/ST step. This partial data PIM is based on the infor- 



A Model Driven Approach for XML Database Development 789 



Table 1. Mapping rules to pass from the data PIM into the data PSM. 



Data PIM 


Data PSM 


Data PIM 


XML Schema 


Class 


XML Element 


Sub class (generalization) 


Subelement of complexType 


Part class (composition) 


Subelement of complexType 


Attribute 


Subelement 


Mandatory 


minOccurs =1 (default) 


Optional 


minOccurs = 0 


Multivalued 


maxOccurs- N 


Composed 


(all | sequence) complexType 


Choice 


choice complexType 


Association 




One-To-One 


Subelement (of any element) for association with complexType 
including a REF element 


One-To-Many 


Subelement (multiplicity N) for association with complexType 
including a REF to element (multiplicity 1 ) 


Many-To-Many 


Subelement (of any element) for association with sequence com- 
plexType of REFs to elements 


Aggregation 




Maximum Multiplicity: 1 


Subelement with complexType with REF element 


Maximum Multiplicity: N 


Subelement with sequence complexType of REF elements 


Composition 


ALL composition relating the compositor with the parts 


Generalization 


Choice complexType 



mation model defined in the DICOM standard. As we can see in figure 4, the Patients 
can make one or more Visits. Each Visit can derive into one or more Studies. A Study 
is formed by several Study Components, which can belong simultaneously to differ- 
ent Studies. A Study Component makes references to several Series, which are a set 
of Images. There are different kinds of Series like Image, Raw Data, etc. A Result is 
obtained from a Study and is composed of several Interpretations. 




Fig. 4. Partial Data PIM in UML 
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4.2 Data PSM 

To obtain the data PSM from the data PIM we have to apply the mappings defined in 
section 3.3 and we have to use the extended UML notation resumed in section 3.2, to 
represent the resulting XML Schema. 

Transformation of classes: In order to transform the UML classes we could split 
them into two groups as follow: 

• The first group is the one formed by those classes that are part classes in a compo- 
sition, as the Interpretation class, as well as those classes that are subclasses in a 
generalization, as the Image and Raw Data classes. These classes are transformed 
into XML complexTypes and the attributes of these classes are mapped into 
subelements of the defined complexTypes. Figure 5 depicts part of the data PSM 
represented in extended UML. The part remarked with a solid line shows the 
transformation of the Interpretation class by means of a complexType named In- 
terpretation_type . 

• The other group is formed by the rest of the classes. Each of these classes is 
mapped into an XML element and its attributes are mapped into subelements of 
the XML element that represents the class related by means of a composition 
stereotyped with «sequence». Figure 5 shows the transformation of the class 
Result , which is remarked with a dashed line. 




Fig. 5. Partial Data PSM in extended UML 



Transformation of associations: In figure 5 the transformation of the one-to-many 
association between the Study and Result classes is remarked with a dotted line. For 
space reasons the Study class is not completely represented in this figure and the rep- 
resentation of its UML attributes were omitted. As the mapping rules indicate, the 
one-to-many association between these classes is transformed adding a subelement to 
the element that represents the class of the maximum multiplicity N. In this case, we 
add the subelement originate to the element Result named as the association, which 
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has a reference to the element that represents the class with the maximum multiplicity 
one. The subelement originate will be related to the Study element by means of a uses 
association. Moreover, figure 5 depicts the transformation of the composition between 
the Result and Interpretation classes. This association is mapped adding a subelement 
to the element that represents the compositor class Result, named as the part class 
Interpretation. The type of the Interpretation element is the complex type Interpreta- 
tionJType defined when mapping the Interpretation class. 

Additionally, figure 6 shows the XML Schema code generated from the UML dia- 
gram depicted in figure 5 . 



<xs: element nafne="Resul' > 

<xs: complexType* I 

<xs: sequence* 

«x$ element name-"lmpres>ons" type-'xsstrng"/*, 

* x_s _e I e me nt _name »" C om ments 1 1 y pe xs^mgVxj 

<X s’ element' naire= v OfV$ 

<xs:complexType» 

<xs;ai> 

<xs:element ref="Study"/> 

</xsal> 

</xscomptexType> 

«75ts; element* 

«xs element name='lnterpcetation' type = Inter pretationjype” 
mn Occur s=“0“ m»xOccurs="unbounded'7= 

</xs. sequence* 

</xs: complex Types 

/xs: element » 



<xs complexType name ^'Interpret at ion Jype"> 

<xs sequences 

<xs element name="lnter_type" type*“xsstring“/> 

«x$ element name-lnter _cfcegno$tic" type-"xs string".'* 
<xs element name=lniter_author’' type=“xs:strrtg"/> 
<xs: element name='Tnter Jext" type="xs: string" /> 

</xs: sequences 

«7xs comptexTypes 



Fig. 6. XML Schema code 



In order to transform the disjoint and incomplete generalization association be- 
tween the Series, Image and Raw Data classes, an element for each subclass has to be 
added into the XML choice complexType, within the element that represents the su- 
perclass. That is to say, the Image and Raw_data elements have to be included into 
the Series element. The complexTypes of the added elements are Image _type and 
Raw_data_type, respectively. These types were created when transforming the Image 
and Raw Data UML classes. Figure 7 shows the resulting transformation in extended 
UML and figure 8 shows the corresponding XML Schema code generated. In both 
figures, the subelements of \mage_type and Raw _data_type were omitted for space 
reasons. 




Fig. 7. Partial Data PSM in extended UML 
(generalization transformation) 



<xs element name=''Sefies* atostract="true*> 

<xs. complexType* 

<xssequence> 

element name= "reference** 

<xs:ccmplexType> 

<xs:at> 

«xs: element ref="Study_CompcnertV> 

</xsa l» 

</xs:complexTYpe^ 

</xs element* 

<xs:choce» 

<xs element n*me=image* tYpe='imagejYpe"^ 

<xs element neme=*Raw_data*type=*f?aw_dala_type'7s 
<txs: choice* 

‘.■7 s- sequence* 

</x$:ccmplexType» 

=7xs elen~ienfi> 



Fig. 8. 8XML Schema code 
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4.3 Database Implementation in Oracle XML DB 

The XML Schema obtained in the previous section was implemented using Oracle’s 
XML DB. Based on the study of different XML DB solutions made in [20] and on the 
previous experience of our research group we have chosen Oracle to validate our 
proposal and to carry out the implementation of the XML DB. But, as we propose to 
use the standard XML Schema as data storage model in XML, the approach will be 
applicable to any DBMS that supports XML Schema. The way in which Oracle stores 
the XML data compliant with the defined XML Schema is shown in figure 9. We use 
the UML extension for OR DB design proposed in [13] to represent it. For space 
reasons, in figure 9 we only show the part corresponding to the XML Schema de- 
picted in figure 5. 



«NT» 

Originate_65_T 



■«R EF»Study_T y pe ' 



«Object_type» 

Result_type 






«UDT » 


1 «table» □ 




-Impressions 

-Comments 

-Originate: Originate65_T 
-«ARRAY»lnterpretation: lnterpretation_63_ 


r 


-lnter_type 
-lnter_diagnosis 
-lnter_author 
-Inter text 











«Object_type» 
Study_Type 
-Reason_for_Study 
-Requesting_physician 
-Study_time 
-Study_date 
-has:has66_T 
-derive_in:derive_in68_T 
-form ed_by : form ed_by70_‘ 



Fig. 9. Implementation in Oracle’s XML DB 



5 Conclusions and Future Work 

Nowadays, there exists different solutions for the storage of XML data, but, in spite of 
existing several works in this line, there is no methodology for the systematic design 
of XML databases. In this paper we have described a model driven approach for the 
development of XML DBs in the framework of MIDAS, a model driven methodology 
for the development of WIS based on MDA. Specifically, we have focused on the 
content aspect of the structural dimension of MIDAS, which corresponds to the tradi- 
tional concept of a DB. There exists different ways of developing a DB. In this paper 
we have proposed the development process for XML DBs, where the data PIM is the 
conceptual data model (UML class diagram) and the data PSM the XML Schema 
model. Both of them will be represented in UML, therefore we have also summarized 
the UML extension to represent XML Schemas. Moreover, we have defined the map- 
pings to transform the data PIM into the data PSM, which will be the XML database 
schema. 

We have developed different case studies to validate our proposal and in this paper 
we have shown part of the case study of the development of a XML DB for the man- 
agement of medical images stored in Oracle’s XML DB. 

We are working on the implementation of a CASE tool (MIDAS-CASE), which in- 
tegrates all the techniques proposed in MIDAS for the semiautomatic generation of 
WIS. The repository of the CASE tool is also being implemented in Oracle’s XML 
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DB, following the approach proposed in this paper. We have already implemented in 
the tool the XML module, including the part for XML Schema and WSDL. Now, we 
are implementing on the one hand the automatic generation of the XML Schema code 
from the corresponding graphical representation of the data PSM in extended UML 
and on the other hand, the semi-automatic transformation from the data PIM to the 
data PSM to obtain the code of the XML DB. 
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Abstract. Updates over virtual XML views that wrap the relational 
data have not been well supported by current XML data management 
systems. This paper studies the problem of the existence of a correct 
relational update translation for a given view update. First, we propose 
a clean extended-source theory to decide whether a translation mapping 
is correct. Then to answer the question of the existence of a correct map- 
ping, we classify a view update as either un-translatable, conditionally or 
unconditionally translatable under a given update translation policy. We 
design a graph-based algorithm to classify a given update into one of 
the three update categories based on schema knowledge extracted from 
the XML view and the relational base. This now represents a practi- 
cal approach that could be applied by any existing view update system 
in industry and in academic for analyzing the translatability of a given 
update statement before translation of it is attempted. 



1 Introduction 

Typical XML management systems [5, 9, 14] support the creation of XML wrap- 
ping views and the querying against these virtual views to bridge the gap be- 
tween relational databases and XML applications. Update operations against 
such wrapper views, however, are not well supported yet. 

The problem of updating XML views published over relational data comes 
with new challenges beyond those of updating relational [1, 7] or even object- 
oriented [3] views. The first is the updatability. That is, the mismatch between 
the hierarchical XML view model and the flat relational base model raises the 
question whether the given view update is even mappable into SQL updates. The 
second is the translation strategy. That is, assuming the view update is indeed 
translatable, how to translate the XQuery updates statements on the XML view 
into the equivalent tuple-based SQL updates expressed on the relational base. 

Translation strategies have been explored to some degree in recent work. [11] 
presents an XQuery update grammar and studies the execution performance of 
translated updates. However, the assumption made in this work is that the given 
update is indeed translatable and that in fact it has already been translated into 
SQL updates over a relational database, which is assumed to be created by a 
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fixed inline loading strategy [8]. Commercial database systems such as SQL- 
Server2000 [10], Oracle [2] and DB2 [6] also provide system-specific solutions 
for restricted update types, again under the assumption of given updates always 
being translatable. 

Our earlier work [12] studies the XML view updatability for the “round- 
trip” case, which is characterized by a pair of invertable lossless mappings for 
(1) loading the XML documents into the relational bases, and (2) extracting an 
XML view identical to the original XML document back out of it. We prove 
that such XML views are always updatable by any update operation valid on 
the XML view. However, to the best of our knowledge, no result in the literature 
focuses on a general method to assess the updatability of an arbitrary XML view 
published over an existing relational database. 

This view updatability issue has been a long standing difficult problem even 
in the relational context. Using the concept of “clean source”, Dayal and Bern- 
stein [7] characterize the schema conditions under which a relational view over 
a single table is updatable. Beyond this result, our current work now analyzes 
the key factors affecting the view updatability in the XML context. That is, 
given an update translation policy, we classify updates over an XML view as 
un-translatable, conditionally or unconditionally translatable. As we will show, 
this classification depends on several features of the XML view and the update 
statements, including: (a) granularity of the update at the view side, (b) prop- 
erties of the view construction , and (c) types of duplication appearing in the 
view. By extending the concept of a “clean source” for relational databases [7] 
into “clean extended-source” for XML, we now propose a theory for determining 
the existence of a correct relational update translation for a given XML view 
update. 

We also provide a graph-based algorithm to identify the conditions under 
which an XML view over a relational database is updatable. The algorithm de- 
pends only on the view and database schema knowledge instead of on the actual 
database content. It rejects un-translatable updates, requests additional condi- 
tions for conditionally translatable updates, and passes unconditionally trans- 
latable updates to the later update translation step. The proof of correctness 
of our algorithm can be found in our technical report [13]. It utilizes our clean 
extended- source theory. 

Section 2 analyzes the factors deciding the XML view updatability, which is 
then formalized in Section 3. In Section 4 we propose the “clean extended-source” 
theory as theoretical foundation of our proposed solution. Section 5 describes our 
graph-based algorithm for detecting update translatability. Section 6 provides 
conclusions. 



2 Factors for XML View Updatability 

Using examples, we now illustrate what factors affect the view updatability in 
general, and which features of XML specifically cause new view update transla- 
tion issues. Recent XML systems [5, 9, 14] use a default XML view to define the 
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book 



bookid 


title 




98001 


TCP/IP Illustrated 


i_. Primary 
LJ Key 

[] Non Key 


98002 


Programming in Unix 


98003 


Data on the Web 



mce | 


bookid 


amount 


website 


98001 


63.70 


www.amazon.com 


98003 


56.00 


www.amazon.com 


98003 


45.60 


www.bookpool.com 



CREATE TABLE book( 
bookid VARCHAR2(20), 
title VARCHAR2(100), 
CONSTRAINTS BookPK 
PRIMARYKEY (bookid)) 

CREATE TABLE price( 
bookid VARCHAR2(20), 
amount DOUBLE, 
website VARCHAR2(100), 
CONSTRAINTS PricePK 

PRIMARYKEY (bookid, website), 
FOREIGNKEY (bookid) 

REFERENCES book (bookid)) 



<DB> 

<book> 

<row> 

<bookid>9800 1 </bookid> 
<title>TCP/IP Illustrated</title> 

</row> ... 

</book> 

<price> 

<row> 

<bookid>9800 1 </bookid> 

<amount>63 .70</ amount> 
<website>www.amazon.com</website> 
</row> ... 

</price> 

<DB> 



Fig. 1. Relational database 



Fig. 2. Default XML view of 
database shown in Figure 1 



one-to-one XML-to-relational mapping (Fig. 2). A view query (Fig. 3) is defined 
over it to express user-specific XML wrapper views. User updates over the vir- 
tual XML views are expressed in XQuery update syntax [11] (Fig. 4). Also, we 
only consider insertion/deletion in our discussion. A replacement is treated as a 
deletion followed by an insertion without specifically discussion. 

2.1 Update Translation Policy 

Clearly, the update translation policy chosen for the system is essential for the 
decision of view updatability. An update may be translatable under one policy, 
while not under another one. We now enumerate common policies observed in 
the literature [3,11] and in practice [14]. 

Policies for update type selection. (1) Same type. The translated update al- 
ways must have the same update type as the given view update. (2) Mixed type. 
Translated updates with a different type are allowed. 

Policies for maintaining referential integrity of the relational database un- 
der deletion. (1) Cascade . The directly translated relational updates cascade to 
update the referenced relations as well. (2) Restrict . The relational update is 
restricted to the case when there are no referenced relations. Otherwise, reject 
the view update. (3) Set Null . The relational update is performed as required, 
while the foreign key is set to be NULL in each dangling tuple. 

The translatability of a valid view update under a given policy can be classified 
as unconditionally translatable , conditionally translatable and un-translatable. A 
view update is called un-translatable if it cannot be mapped into relational up- 
dates without violating some consistency. A view update is unconditionally trans- 
latable if such a translation always exists under the given policy. Otherwise, we 
call it conditionally translatable. That is, under the current update policy, the 
given update is not translatable unless additional conditions, such as assump- 
tions or user communication, are introduced to make it translatable. 

When not stated otherwise, throughout the paper we pick the most commonly 
used policy, that is, same update type and delete cascade. If a different translation 
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Qi 

<bib> 

FOR $book IN document("default.xml")/book/row 
RETURN { 

<book_info> 

$book/bookid, 

$book/title, 

FOR $price 

IN document("default.xml")/price/row 

WHERE 

$book/bookid = $price/bookid 
RETURN { 

<price_info> 

$price/amount, 

$price/website 

</price_info>} 

</book_info> 

</bib> } 

VI 

<bib> 

<book_info> 

<bookid>9800 1 </bookid> 

<title> TCP/IP Illustrated </title> 
<price_info> 

<amount>63 . ' 70</ amou n t > 

<website> www.amazon.com</website> 
</price_info> 

</book_info> 

<book_info> 

<bookid>98003</bookid> 

<title>Data on the Web</title> 

<price_info> 

<amount>5 6 .00</ amou n t > 

<website> www.amazon.com</website> 
</price_info> 

<price_info> 

<amount>45 .60</amount> 

<website> 

www.bookpool.com 

</website> 

</price_info> 

</book_info> 



Q2 

<bib> 

FOR $book IN document("default.xml")/book/row, 
$price IN document("default.xml")/price/row 
WHERE $book/bookid = $price/bookid 
RETURN { 

<price_info> 

$price/amount, 

$price/website, 

<book_info> 

$book/bookid, 

$book/title 

</book_info> 

</price_info> 



Q3 

<bib> 

FOR Shook IN document("default.xml")/book/row, 
Sprice IN document("default.xml")/price/row 
WHERE Sbook/bookid = Sprice/bookid 
RETURN { 

<book_info> 

Sbook/bookid, 

Sbook/title, 

<price_info> 

Sprice/amount, 

Sprice/website 

</price_info> 

</book_info> 

</bib>} 



(a) View VI defined by Ql 



(b) View V2 defined by Q2 



(c) View V3 defined by Q3 



V2 



<price_info> 

<amount>63 . 7 0</ a mou n t > 

<website> www.amazon.com </website> 
<book_info> 

<bookid>9800 1 </bookid> 
<title>TCP/IP Illustrated</title> 
</book_info> 

</price_info> 

<price_info> 

<amount>56.00</amount> 

<website> www.amazon.com </website> 
<book_info> 

<bookid>98003</bookid> 

<title>Data on the Web</title> 
</book_info> 

</price_info> 

<price_info> 

<amount>45 .60</amount> 

<website> www.bookpool.com </website> 
<book_info> 

<bookid>98003</bookid> 

<title>Data on the Web</title> 
</book_info> 

</price_info> 



V3 

<bib> 

<book_info> 

<bookid>9800 1 </bookid> 

<title> TCP/IP Illustrated </title> 
<price_info> 

<amount>63 .7 0</amount> 

<website> www.amazon.com </website> 
</price_info> 

</book_info> 

<book_info> 

<bookid>98003</bookid> 

<title>Data on the Web</title> 

<price_info> 

<amount>56.00</amount> 

<website> www.amazon.com </website> 
</price_info> 

</book_info> 

<book_info> 

<bookid>98003</bookid> 

<title>Data on the Web</title> 

<price_info> 

<amount>45 .60</amount> 

<website> www.bookpool.com </website> 
</price_info> 

</book_info> 



Q4 



V4 



<bib> 

FOR Shook IN document("default.xml")/book/row 
RETURN { 

<book_info> 

Sbook/bookid, 

Sbook/title, 

FOR Sprice 

IN document("default.xml")/price/row 
WHERE Sbook/bookid = Sprice/bookid 
RETURN { 

<price_info> 

Sbook/bookid, 

Sprice/amount, 

Sprice/website 

</price_info>} 

</book_info> 

</bib> } 



<bib> 

<book_info> 

<bookid>9800 1 </bookid> 

<title> TCP/IP Illustrated </tit!e> 
<price_info> 

<bookid>9800 1 </bookid> 
<amount>63.70</amount> 

<website> www.amazon.com </website> 
</price_info> 

</book_info> 

<book_info> 

<bookid>98003</bookid> 

<title>Data on the Web</title> 



<bookid>98003</bookid> 
<amount>56.00</amount> 

<website> www.amazon.com </website> 
</price_info> 

<price_info> 

<bookid>98003</bookid> 

<amount>45 . 60</amount> 

<website> www.bookpool.com </website> 
</price_info> 

</book_info> 

</bib> 



(d) View V4 defined by Q4 



Fig. 3. View VI to V4 defined by XQuery Ql to Q4 respectively 



policy is used, then the discussion can be easily adjusted accordingly. Also, we do 
not indicate the order of the translated relational updates. For a given execution 
strategy, the correct order can be easily decided [1, 11, 12]. 



2.2 New Challenges Arising from XML Data Model 

Example 1 (View construction consistency) . Assume two view updates u\ and 
uX (Fig. 4) delete a “book_info” element from VI and V2 in Fig. 3 respectively. 
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FOR $root IN document("Vl.xml"), 

Shook IN $root/book_info 
WHERE $book/title/text() = " TCP/IP Illustrated’ 
UPDATE Sroot { 

DELETE Shook } 



FOR Sroot IN document("V2.xml"), 

Shook IN $root/price_info/book_info 
WHERE $book/title/text() = " TCP/IP Illustrated" 
UPDATE Sroot { 

DELETE Shook } 



FOR Sroot IN document(' , V4.xml"), 

Shook IN $root/book_info 
WHERE $book/title/text() = "Data on the Web” 
AND $book/price_mfo/website = " www.amazon.ci 
UPDATE Sroot { 

DELETE $book/price_info) 



FOR Sroot IN document("V2.xnil"), 

Sprice IN $root/price_info 

WHERE $price/book_info/title/text() = "TCP/IP Illustrated' 
UPDATE Sroot { 

DELETE Sprice } 



^3 


» v . 


FOR Sroot IN document("V3.xml"), 


FOR Sroot IN document("V3.xml"), 


Sbook IN $root/book_info 


Sbook IN Sroot/book info 


WHERE $book/title/text() = " Data on the Web" 


WHERE $book/title/text() = " Data on the Web“ 


AND $book/price_info/website = " www.amazon.com" 


AND $book/price info/website = " www.amazon.com" 


UPDATE Sroot { 


UPDATE Sroot { 


DELETE Sbook ) 


DELETE $book/price_info } 



FOR Sroot IN document("V4.xml"), 

Shook IN $root/book_info 
WHERE $book/title/text() = "Data on the Web" 

AND $book/price_info/website = " www.amazon.com" 
UPDATE Sroot { 

DELETE Sbook } 



FOR Sroot IN document("V3.xml") 

UPDATE Sroot { 

INSERT 

<book_info> 

<bookid>"98003"<bookid> 

<title>" Data on the Web "</title> 
<price_info> 

<amount>56.00</amount> 

<website>www.ebay.com</website> 

</price_info> 

</book_info> } 



Fig. 4. Update operations on XML views defined in Fig. 3 



<bib> 

<book_info> 

<bookid>9 8 003 </bookid> 

<title>Data on the Web</title> 

<price_info> 

<amount>5 6 .00</amount> 
<website>www.amazon.com</website> 
</price_info> 

<price_info> 

<amount>45.60</amount> 

<website>www.bookpool.com</website> 

</price_info> 

</book_info> 

</bib> 



(a) VI’ 



(d) Ql(D'). Same with (a). 



u x R : DELETE FROM book 

WHERE book.ROWID IN ( 

SELECT DISTINCT book.ROWID FROM book 
WHERE (book.title = TCP/IP Illustrated') ) 



u 2 r : DELETE FROM price 

WHERE price.ROWID IN ( 

SELECT DISTINCT price.ROWID FROM book,price 
WHERE (book.title = TCP/IP Illustrated') AND 
(book.bookid = price.bookid) ) 

(b) U R 



book price 



bookid 


title 


bookid 


amount 


website 


98002 


Programming in Unix 


98003 


56.00 


www.amazon.com 


98003 

4 


Data on the Web 


98003 


45.60 


www.bookpool.com 



(c) D' 



Fig. 5. Translate u\ (a) Vl'\ The user expected updated view, (b) U R : The translated 
update, (c) D': The updated relational database, (d) Ql(D')-. The regenerated view 



(i) Fig. 5 shows u\ is unconditionally translatable. The translated relational 
update sequence U R in Fig. 5(b) will delete the first book from the “book” 
relation by u R , and its prices from the “price” relation through u R - By re- 
applying the view query Q 1 on the updated database D' in Fig. 5(c), the updated 
XML view in Fig. 5(d) equals the user expected updated view VI' in Fig. 5(a). 

(ii) Fig. 6 shows u\ is un-translatable. First, the relational update u R in Fig. 6(b) 
is generated to delete the book (bookid=98001) from the “book” relation. Note 
the foreign key from the “price” relation to the “book” relation (Fig. 1). The 
second update operation u R will be generated by the update translator to keep 
the relational database consistent. The regenerated view in Fig. 6(d) is different 
than the user expected updated view V2' in Fig. 6(a). No other translation is 
available which could preserve consistency either. 

The existence of a correct translation is affected by the view construction 
consistency property, namely, whether the XML view hierarchy agrees with the 
hierarchical structure implied by the base relational schema. 
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<bib> 

<price_info> 

<amount>63 .70</amount> 
<website>www.amazon.com</website> 
</price_info> 

<price_info> 

<amount>5 6 .00</amount> 

<website>www.amazon.com</website> 

<book_info> 

<bookid>98003</bookid> 
<titIe>Data on the Web</title> 
</book_info> 

</price_info> 

</bib> 

(a) V2' 



(b) U R . Same with Fig. 5(b). 

(c) D ' . Same with Fig. 5(c). 



<bib> 

<price_info> 

<amount>56.00</amount> 

<website>www.amazon.com</website> 

<book_info> 

<bookid>98003</bookid> 
<title>Data on the Web</title> 
</book_info> 

</price_info> 



</bib> 



(d) Q2(D') 



Fig. 6. Translate u\ (a) V2 ' : The user expected updated view, (b) U R \ The translated 
update, (c) D': The updated relational database, (d) Q2(D'): The regenerated view 



Example 2 (Content duplication). Next we compare the two virtual XQuery 
views VI and V2> in Fig. 3. The book (bookid=98003) with two prices is ex- 
posed twice in V^3, while only once in VI. The update u % in Fig. 4 will delete 
the “bookjnfo” element from amazon, while keeping the one from bookpool. 
Now should we delete the book tuple underneath? It is unclear. An additional 
condition, such as an extra translation rule like “Ao underlying tuple is deleted 
if it is still referenced by any other part, of the vieid’’ could make the update u $ 
translatable by keeping the book tuple untouched. This update is thus called 
conditionally translatable. 

This ambiguous content duplication is introduced by the XQuery “FOR” 
expression. This property could also arise in relational Join views. 

Example 3 (Structural duplication). Given Q 4 in Fig. 3 with each “bookid” ex- 
posed twice in the single “bookjnfo” element. The update u\ in Fig. 4, which 
deletes the first price of the specified book, is classified as conditionally trans- 
latable. Since the primary key “bookid” is touched by u\ , we cannot decide 
whether to delete the book-tuple underneath. With an additional condition, such 
as knowledge of the user intention about the update, u\ becomes translatable. 

Structural duplication, as illustrated above, is special to XML view updating. 
While it also exists in the relational context, it would not cause any ambiguity. 
The flat relational data model only allows tuple-based view insertion/deletion. 
The update touches all not just some of the duplicates within a view tuple. In- 
stead of always enforcing an update on the biggest view element “bookjnfo” , the 
flexible hierarchical structure of XML allows a “partial” update on subelements 
inside it. Inconsistency between the duplicated parts thus occurs. 

Example 4 (Update granularity) . Compared with the failure of translating u\ 
in Example 1, the update u\ in Fig. 4 on the same view V2 is conditionally 
translatable. u\ deletes the whole “price jnfo” element instead of just the sub- 
element “bookjnfo” . The translated relational update sequence U R is the same 
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as in Fig. 6(b). The regenerated view is the same as what the user expects. Due 
to content duplication, u£ is said to be conditionally translatable. 

XML hierarchical structure offers an opportunity for different update granu- 
larity , an issue that does not arise for relational views. 



3 Formalizing the Problem of XML View Updatability 

The structure of a relation is described by a relation schema TZ(Af , A, F ) , 
where Af is the name of the relation, A = {ai, < 22 , a m } is its attribute set, 
and T is a set of constraints. A relation R is a finite subset of dom(A ), a prod- 
uct of all the attribute domains. A relational database, denoted as D , is a 
set of n relations R\, .... R n . A relational update operation u R £ 13 R is a 
deletion, insertion or replacement on a relation R. A sequence of relational up- 
date operations, denoted by U R = {u R , u R , ..., u R } is also modeled as a function 
U r {D) = u R («*_!(.. (£>)))). 



Table 1 . Notations for XML view update problem 



D 


relational database 


TZ(JV, A, T) 


schema of relation 


R 


relation 


U" 


domain of relational update operations 


u R 


relational update operation 


U K 


sequence of relational update operations 


V 


XML view 


DEF V 


XML view definition 


u v 


view update 


W 


domain of view update operations 



An XML view V over a relational database D is defined by a view defi- 
nition DEF X (an XQuery expression in our case). The domain of the view is 
denoted by domfV). Let rel be a function to extract the relations in D refer- 
enced by DEF V , then rel(DEF v ) = i?j 2 , ..., Ri p } C D. An XML view 

schema is extracted from both DEF V and rel(DEF v ). See [13] for details. 

Let u x £ b v be an update on the view V. A valid view update (e.g., Fig. 4) 
is an insertion or deletion that satisfies all constraints in the view schema. 

Definition 1 . Given an update translation policy. Let D be a relational database 
and V be a virtual view defined by DEF V . A relational update sequence U R is 
a correct translation of u v iff (a) u v (DEF V (D)) = DEF V (U R (D)) and (b) 
if u v (DEF V (D)) = DEF V {D) => U R {D) = D. 

First, a correct translation means the “rectangle” rule holds (Fig. 7). Intu- 
itively, it implies the translated relational updates do not cause any view side 
effects. Second, if an update operation does not affect the view, then it should 
not affect the relational base either. This guarantees any modification of the 
relational base is indeed done for the sake of the view. 

Fig. 8 shows a typical partition of the view update domain 13 V . The XML 
view updatability classifies a valid view update as either unconditionally trans- 
latable, conditionally translatable or un- translatable. 
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(2)u v 



► u v (V) 



( 1 ) 

DEF V 



(4) 

DEP 



(3)U R 



► U R (S) 



Untranslatable 



Conditionally Translatable 



Invalid 

Update 



Unconditional ly 
Translatable 



Legend: 

i _ ! Invalid Update 
l~l Valid Update 
B Untranslatable 
Conditionally 
® Translatable 
| — | Unconditionally 
Translatable 



Fig. 7. Correct translation of view Fig. 8. The partition of view update domain 13 V 
update to relational update 



4 Theoretical Foundation for XML View Updatability 

Dayal and Bernstein [7] show that a correct translation exists in the case of a 
“clean source”, when only considering functional dependencies inside a single 
relation. In the context of XML views, we now adopt and extend this work to 
also consider functional dependencies between relations. 

Definition 2. Given a relational database D and an XML view V defined over 
several relations rel(DEF v ) C D. Let v be a view element of V. Let g = 
(ti,...,t p ) be a generator of v, where ti £ R x for R x £ rel(DEF l ). Then ti 
is called a source tuple in D of v. 

Further, tj £ R y is an extended source tuple in D of v iff 3 ti £ g that 
ti.ak is a foreign key oftj.a z , where ak £ IZ X (A), a z £ lZ y (A) and R x ,R y £ 
rel(DEF v ). g e = g U {tj \ tj is an extended source tuple of u} is called an 

extended generator ofv. 

A source tuple is a relational row used to compute the view element. For 
instance, in VI of Fig. 3, the first view element v\ is bookXnfo element with 
bookid=98001. Let Ri and f ?2 denote the book and price relations respectively, 
then the generator g of is g = (ti, ^ 2 ), where t\ £ R\ is the book tu- 
ple (98001, TCP/IP Illustrated) and £2 £ R 2 is the price tuple (98001, 63.70, 
www.amazon.com). Let the view-element v% be the title of V\. Then the source 
tuple of V 2 is t\. Since t\.bookid is a foreign key of t 2 -bookid, we say ^2 is an 
extended source tuple of V 2 , and g e = (t\,t 2 ) is an extended generator of V 2 - 

Definition 3. Let V° be a part, of a given XML view V. Let G{V°) be the set 
of generators of V° defined by G(V°) = {g \ g is a generator of a view-element, 
in V 0 }. For each g = (t 1 , ...,t p ) £ G(V°), let. H(g) be some nonempty subset of 
{ti | ti £ g}. Then any superset of Li g& G(v°)H(g) is a source in D of V° . (If 
G(V°) = 0, then V° has no source in D.) 

Similarly, let G e (V°) be the set of extended generators for view elements in 
V°. Then any superset of U gG G e (v°)H(g) is an extended source in D ofV°, 
denoted by S e . 

A source includes the underlying relational part of a view “portion” V° which 
consists of multiple view-elements. For example, let V° = VI (Fig. 3), G(V°) = 
{ 311 * 72 }, where g\ = {(98001, TCP/IP Illustrated), (98001, 63. 70, www. amazon. 
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com)}, <72 = {(98003, Data on the Web), (98003, 56. 00, www.amazon.com) , (98003, 
4-5.60, www.bookpool.com)} . That is, G(V°) includes all the generators for view 
elements in V°. Let H(gffj = {(98001, TCP/IP Illustrated)} and H(g 2 ) = 
{(98003, 56.00, www.amazon.com)}. Then {(98001, TCP/IP Illustrated), (98003, 
56.00, www.amazon.com)} is a source of V°, also an extended source of V°. 

Definition 4. Let D = {R\, ..., R n } be a relational database. Let V° be part of 
a given XML view V and S e be an extended source in D of V° . S e is a clean 
extended source in D of V° iff (iv € V — P°), (ffS' e ) such that S' e is an 
extended source in (f?i — S e \,...,R n — S en ) ofv. Or, equivalently, S e is a clean 
extended source in D of V° iff (dv £ V — V°)(S e is not an extended source in 
D of v). 

A clean extended source defines a source that is only referenced by the given 
view element itself. For instance, given the view-element v in V2 (Fig. 3) rep- 
resenting the bookJnfo element (bookicl = 98001), its extended source {(98001, 
TCP/IP Illustrated), (98001, 63. 70, www.amazon.com)} is not a clean extended 
source since it is also an extended source of the price element. 

The clean extended source theory below captures the connection between 
clean extended source and update translatability (Proofs in [13]). It serves as a 
conservative solution for identifying the ( unconditionally ) translatable updates. 

Theorem 1. Let u x be the deletion of a set of view elements V d C V. Let r be 
a translation procedure, r(iff ,D) = U R . Then t correctly translates u v to 
D iff U R deletes a clean extended source of V d . 

By Definition 1, a correct delete translation is one without any view side 
effect. This is exactly what deleting a clean extended-source guarantees by Def- 
inition 4. Thus Theorem 1 follows. 

Theorem 2. Let iff be the insertion of a set of view elements V 1 into V. Let 
V~ = V — V 1 , V u = V 1 — V. Let t be a translation procedure, t(u v ,D) = U R . 
Then r correctly translates iff to D iff (i) (ff/v £ V U )(U R inserts a source 
tuple of v) and (ii) ff/v £ dom(V ) — (V u U V~))(U R does not insert a source 
tuple of v). 

Since dom(V) — (V u U V ~ ) = ( domfV ) — ( V 1 U V)) U (V* fl V), Theorem 2 
indicates a correct insert translation is the one without any duplicate insertion 
(insert a source of V 1 (~l V ) and any extra insertion (insert a source of domfV) — 
(V 1 U P)). That is, it inserts a clean extended source for the new view-element. 
Duplicate insertion is not allowed by BCNF, while extra insertion will cause a 
view side effect. For example, for u / in Fig. 4, let uf = {Insert (98003, Data 
on the Web) into book}, u R = {Insert. (98003, 56. 00, www.ebay.com) into price}. 
Then U R = {u?'iU R } is not a correct translation since it inserts a duplicate 
source tuple into book. While U R = {w^} is a correct translation. 
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5 Graph-Based Algorithm for Deciding View Updatability 

We now propose a graph-based algorithm to identify the factors and their ef- 
fects on the update translatability based on our clean extended source theory. 
We assume the relational database is in the BCNF form. No cyclic depen- 
dency caused by integrity constraints among relations exists. Also, the pred- 
icate used in the view query expression is a conjunction of non- correlation 
(e.g., %price/website = “ www.amazon.com ”) or equi- correlation predicates (e.g., 
$book/bookid = $ price /bookid). 



5.1 Graphic Representation of XML Views 

Two graphs capture the update related features in the view V and relational 
base D. The view relationship graph Gr(Nq r , Eg R ) is a forest representing 
the hierarchical and cardinality constraints in the XML view schema. An internal 
node, represented by a triangle A, identifies a view element or attribute labeled 
by its name. A leaf node (represented by a small circle o) is an atomic type, 
labeled by both the XPath binding and the name of its corresponding relational 
column R x .ak ■ An edge e(ni,n 2 ) £ Eg R represents that ni is a parent of ri 2 
in the view hierarchy. Each edge is labeled by the cardinality relationship and 
condition (if any) between its end nodes. A label “?” means each parent node can 
only have one child, while shows multiple children are possible. Figures 9(a) 
to 9(d) depict the view relationship graphs for VI to V4 in Fig. 3 respectively. 

Definition 5. The hierarchy implied in relational model is defined as: 

(1) Given a relation schema TZ(M, A, IF), with A = {a,|l < i < m}, then M is 
called the parent of the attribute at ( 1 < i < m ). 

(2) Given two relation schemas lZi(Ni,Ai,tFi) and IZj (Mj ,Aj ,Ej), with foreign 
key constraints defined as PKfilZi) <— FKfilZj), then Mi is the parent of Mj. 

The view trace graph Gr(Mg T , Eg T ) represents the hierarchical and cardi- 
nality constraints in the relational schema underlying the XML view. The set of 
leaf nodes of Gt correspond to the union of all leaves of Gr- Specially, a leaf node 
labeled by the primary key attribute of a relation is called a key node (depicted 
by a black circle •). An internal node, depicted by a triangle A, is labeled by 
the relation name. Each edge e(ni,ri 2 ) £ Eg r means ni is the parent of n 2 by 
Definition 5. An edge is labeled by its foreign key condition (if it is generated by 
rule (2) in Definition 5), and the cardinality relationship between its end nodes. 
The view trace graphs of VI to V4 are identical (Fig. 10), since they all defined 
over the same attributes of base relations. 

The concept of closure in Gr and Gt is used to represent the “effect” of an 
update on the view and on the relational database respectively. Intuitively, their 
relationship indicates the updatability of the given view. 

The closure of a node n £ Ng R , denoted by rig R , is defined as follows: (1) If n 
is a leaf node, Ug R = {?r}. (2) Otherwise, rig R is the union of its children’s closures 
grouped by their hierarchical relationship and marked by their cardinality (for 
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price/row/amoun t price/row/website 

price.amount price, website 




book/row/bookid book/row/title 
baok.bookid book.tiile 




price/row/aniount price/row/website 
price.amount price, website 




or"'" 

book/row/bookid 

hook.hookid 




book/row/title 

book.tiile 



price_info 



book/row/bookid price/row/amount price/row/website 
book.bookid price.amount price, website 

(d) 



*Note: con = (book/row/bookid=price/row/bookid) 



Fig. 9. Qn of VI to V4 as shown by (a) to (d) 




*Note: con = (book.bookid=price.bookid) 



Fig. 10. Q t of VI - V4 



simplicity, not shown when cardinality mark is ?). For example, in Figure 9(a), 
( n 3 )g R = {n 3 }, while (n 5 )J R = {n 6 ,n 7 }, (n 2 )g R = {n 3l rH, ( n 6l n 7 )*}. 

The closure of a node n £ Ng T is defined in the same manner as in Qr, 
except for leaf nodes. Each leaf node has the same closure as its parent node. 
For instance, in Fig. 10, (n 2 )g T = (n 3 )g T = (m)g T = {n 2 , n 3 , (n 5 , n 6 )*}. This 
closure definition in Qt is based on the pre-selected update policy in Section 
2.1. If a different policy were used, then the definition needs to be adjusted 
accordingly. For example, if we pick the mixed type , the closure will be “only 
the key node has the same closure definition as its parent node, while any other 
leaf node has itself as the closure”. Consequently in Fig. 10, (n 3 )g T = {n 3 }, 
while {n 2 )g T = {n 2 ,n 3 ,(n 5 ,ne)*}. The delete on these non-key leaf nodes can 
be translated as a replacement on the corresponding relational column. 

To reduce the closure definition, the group mark “()” can be eliminated if its 
cardinality mark is “?”. For example, in Figure 9(c), ( n 2 )g T = { n 3 > n i, ( n &> n 7)} = 
{n 3 , ri 4 , ri6, n 7 }. The closure of a set of nodes N, denoted by N + , is defined as 
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N + = U( ni eiV) n i"> w h ere U is a “Union-like” operation that combines not only 
the nodes but their shared occurrence. For instance, in Fig. 10, {ri 2 , n o}g T = 
( n 2 )g T \J( n 5)g T = {n 2 ,n 3 , (n 5 ,n 6 )*}\J{n 5 ,ne} = {n 2 ,n 3 , (n 5 ,n 6 )*}. Two leaf 
nodes in Q r or Qt are equal if and only if the relational attribute labels in their 
respective node labels are the same. 

5.2 A Graph-Based Algorithm for View Updatability Identification 

Definition 6. Two closures C\ and C 2 match, denoted by C\ = C 2 , iff the 
node set of C\ and C 2 are equal. Further, C\ and C 2 are equal, denoted by 
Ci = C 2 , iff the node groups, cardinality marks of each group, and conditions 
on each “*” edge are all the same. 

For two closures to match means that the same view schema nodes are 
included. While equality indicates that the same instances of XML view ele- 
ments will be included. For example, (n 2 )g R hr Fig. 9(c) and (fM)<j T in Fig. 
10 match. That is, both closures include the same XML view schema nodes: 
book.bookid, book. title, price. amount, price. website. However, (ri 2 )g R in Figure 
9(a) and {ni)g T in Figure 10 are equal, namely {book.bookid, book. title, (price, 
amount, price. website)*}. This is because their group partition (marked by “()”), 
cardinality mark (* or ?) and conditions for each edge are all the same. Both 
closures touch exactly the same XML view-element instances. 

Theorem 3. Let V be a view defined by DEF V over a relational database D 
with the view relationship graph Gii(Ng R , Eg R ) and view trace graph QT{Ng T , 
Eg T ) . Let Y C Ng R and X C Ng T . (V generators g,g' of view elements v and 
v' respectively, g[X ] = g'[X\ => r)[U] = t/[T]) iff their closures Xg T = Yg . 

Theorem 3 indicates that two equal generators always produce the identical 
view elements iff the respective closures of the view schema nodes in Qr and 
Qt are equal. Theorem 3 now enables us to produce an algorithm for detecting 
the clean extended sources S e of a view element based on schema knowledge 
captured in Qr and Qt- 

Theorem 4. Let V,DEF v ,D,Qr,Qt,Y be defined as in Theorem 3. Given a 
view element v £ V(Y), there is a clean extended source S e of v in D iff (3A C 
Ng T ) such that Xg T = Yg R . 

Theorem 4 indicates that a given view element v has a clean extended source 
iff the closure of its schema node in Qn has an equal closure in Qt- As indicated 
by Theorems 1 and 2, the existence of a clean extended source for a given XML 
view element implies that the update touching this element is unconditionally 
translatable. The following observation thus serves as a general methodology for 
view updatability determination. 

Observation 1 Let D, V, Qr, Qt,Y be defined as in Theorem 3. (1) Updates that 
touch Y are unconditionally translatable iff (3 X C Ng T ) such that Xg T = 
Yg R . (2) Updates that touch Y are conditionally translatable lff(3XCNg T ) 
such that Xg T = Yg R . (3) Otherwise, updates on Y are un-translatable. 
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However, searching all node closures in Gt to find one equal to the closure of 
a given view-element is expensive. According to the generation rules of Gt, the 
nodes in the closure of v also serve as leaf nodes in Gt ■ We thus propose to start 
searching from leaf nodes within the closure, thus reducing the search space. 
Observation 2 utilized the following definition to determine the translatability 
of a given view update. 

Definition 7. Let n be a node in Gr(V), with its closure in Gr denoted by 
Cr = ng R . Let. C T = U (ni€C R )( n i)g T ’ where ( ni)g T be the closure of rii in Gt- 
We say n is a clean node iff Cr = Ct, a consistent node iff Cr = Cr and 

an inconsistent node otherwise. 

For a node to be inconsistent means that the effect of an update on the 
view (node closure in Gr) is different from the effect on the relational side (node 
closure in Gt) based on the selected policy (closure definition in Gt)- It is thus 
un-translatable. A clean node is guaranteed to be safely updatable without any 
view side-effects. A dirty consistent node, however, needs an additional condition 
to be updatable. For example, n$ in Fig. 9(a) is a clean node. In Fig. 9(b), ns is 
an inconsistent node and n 2 is a dirty consistent node. 

Observation 2 An update on a clean node is unconditionally translatable, on 
a consistent node it is conditionally translatable, while on an inconsistent node 
it is un-translatable. 



Algorithm 1 Optimized Update Translatability Checking Algorithm 



/*Given Gr and Qt of a view V, determine 
the translatability of a view update u*/ 

Procedure checkTranslatability(u, Gr, 
Gt) 

Node n = identifyNodeToUpdate(it, Gr) 
classifyNode(n, Gr, Gt) 
if n is a clean node then 

n is unconditionally translatable 
else 

if n is a consistent node then 
n is conditionally translatable 
else 

n is untranslatable 

end if 
end if 



/*Classify the node n £ Gr to be updated*/ 
Procedure classifyNode (n, Gr, Gt) 
Initiate Cr and Ct empty 
Cr — computeClosure(n, Gr) 
while Cr has more node do 
get the next node ni £ Cr 
Ct = Ct U computeClosure(rij , Gt) 
end while 
if Cr — Ct then 
if Cr = Ct then 

n is a consistent node 
else 

n is a clean node 

end if 
else 

n is an inconsistent node 

end if 



Algorithm 1 shows our optimized update translatability checking algorithm 
using Observation 2. It first identifies the deleting/inserting Gr node. Then, 
using Definition 7 the procedure classifyNode ( n, Gr, Gt) determines the type 
of the node to be updated. Thereafter the given view update can be classified as 
un-translatable, conditionally or unconditionally translatable by Observation 2. 
Using this optimized update translatability checking algorithm, a concrete case 
study on the translatability of deletes and inserts is also provided in [13]. 
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6 Conclusion 

In this paper, we have identified the factors determining view updatability in 
general and also in the context of XQuery views in particular. The extended 
clean-source theory for determining translation correctness is presented. A graph- 
based algorithm has also been presented to identify the conditions under which 
a correct translation of a given view update exists. 

Our solution is general. It could be used by an update translation systems 
such as [4] to identify the translatable update before translation of it is at- 
tempted. This way we would guarantee that only a “well-behaved” view update 
is passed down to the next translation step. [4] assumes the view is always well- 
formed, that is, joins are through keys and foreign keys, and nesting is controlled 
to agree with the integrity constraints and to avoid duplication. The update over 
such a view is thus always translatable. Our work is orthogonal to this work by 
addressing new challenges related to the decision of translation existence when 
conflicts are possible, that is a view cannot always be guaranteed to be well- 
formed (as assumed in this prior work). 

Our view updatability checking solution is based on schema reasoning, thus 
utilizes only view and database schema and constraints knowledge. Note that the 
translated updates might still conflict with the actual base data. For example, 
an update inserting a book (bookid = 98002) to VI is said to be unconditionally 
translatable by our schema check procedure, while conflicts with the base data 
in Fig. 1 may still arise. Depending on selected update translation policy, the 
translated update can then be either rejected or executed by replacing the ex- 
isting tuple with the newly inserted tuple. This run-time updatability issue can 
only be resolved at execution time by examining the actual data in the database. 
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Abstract. Past research work on modeling and managing temporal in- 
formation has, so far, failed to elicit support in commercial database sys- 
tems. The increasing popularity of XML offers a unique opportunity to 
change this situation, inasmuch as XML and XQuery support temporal 
information much better than relational tables and SQL. This is the im- 
portant conclusion claimed in this paper where we show that valid-time, 
transaction-time, and bitemporal databases can be naturally viewed in 
XML using temporally-grouped data models. Then, we show that com- 
plex historical queries, that would be very difficult to express in SQL 
on relational tables, can now be easily expressed in standard XQuery 
on such XML-based representations. We first discuss the management of 
transaction-time and valid-time histories and then extend our approach 
to bitemporal histories. The approach can be generalized naturally to 
support the temporal management of arbitrary XML documents and 
queries on their version history. 



1 Introduction 

While users’ demand for temporal database applications is only increasing with 
time [1], database vendors are not moving forward in supporting the management 
and querying of temporal information. Given the remarkable research efforts 
that have been spent on these problems [2], the lack of viable solutions must be 
attributed, at least in part, to the technical difficulties of introducing temporal 
extensions into the relational data model and query languages. 

In the meantime, database researchers, vendors and SQL standardization 
groups are working feverishly to extend SQL with XML publishing capabili- 
ties [4] and to support languages such as XQuery [5] on the XML-published 
views of the relational database [6]. In this context, XML and XQuery can re- 
spectively be viewed as a new powerful data model and query language, thus 
inviting the natural question on whether they can provide a better basis for rep- 
resenting and querying temporal database information. In this paper, we answer 
this critical question by showing that transaction-time, valid-time and bitempo- 
ral database histories can be effectively represented in XML and queried using 
XQuery without requiring extensions of current standards. This breakthrough 
over the relational data model and query languages is made possible by (i) the 
ability of XML to support a temporally grouped model, which is long-recognized 
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as natural and expressive [7, 8] but could not be implemented well in the flat 
structure of the relational data model [9] , and (ii) the greater expressive power 
and native extensibility of XQuery (which is Turing-complete [10]) over SQL. 
Furthermore, these benefits are not restricted to XML-publislred databases; in- 
deed these temporal representations and queries can be naturally extended to 
arbitrary XML documents, and used, e.g., to support temporal extensions for 
database systems featuring native support for XML and XQuery, and in pre- 
serving the version history of XML documents, in archives [11] and web ware- 
houses [12]. 

In this paper, we build and extend techniques described in previous papers. 
In particular, support for transaction time was discussed in [13], and techniques 
for managing document versions were discussed in [12]. However, the focus of 
this paper is supporting valid-time and bitemporal databases, which pose new 
complexity and were not discussed in previous papers. 

The paper is organized as follows. After a discussion of related work in the 
next section, we study an example of temporal relations modeled with a temporal 
ER model. In Section 4 we show that the valid time history of relational database 
history can be represented as XML, and queried with XQuery. Section 5 briefly 
reviews how to model transaction-time history with XML. In Section 6, we focus 
on an XML-based bitemporal data model to represent the bitemporal relational 
database history, and show that complex bitemporal queries can be expressed 
with XQuery based on this model, and database update can also be supported. 
Section 7 concludes the paper. 

2 Related Work 

Temporal ER Modeling. There has been much interesting work on ER.-based 
temporal modeling of information systems at the conceptual level. For instance, 
ER models have been supported in commercial products for database schema 
designs, and more than 10 temporal enhanced ER models have been proposed 
in the research community [14]. As discussed in the survey by Gregersen and 
Jensen [14], there are two major approaches of extensions to ER model for tem- 
poral support, devising new notational shorthands, or altering the semantics of 
the current ER model constructs. The recent TIMEER model [15] is based on 
an ontological foundation and supports an array of properties. Among the tem- 
poral ER models, the Temporal EER Model (TEER) [16] extends the temporal 
semantics into the existing EER modeling constructs. 

Temporal Databases. A body of previous work on temporal data models 
and query languages include [17-20]; thus the design space for the relational 
data model has been exhaustively explored [2,21]. Clifford et al. [9] classified 
them as two main categories: temporally ungrouped and temporally grouped data 
models. Temporally grouped data model is also referred to as non-first-normal- 
form model or attribute time stamping, in which the domain of each attribute 
is extended to include the temporal dimension [8], e.g., Gadia’s temporal data 
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model [22]. It is shown that the temporally grouped representation has more 
expressive power and is more natural since it is history-oriented [9] . TSQL2 [23] 
tries to reconcile the two approaches [9] within the severe limitations of the 
relational tables. Our approach is based on a temporally grouped data model, 
which dovetails perfectly with the hierarchical structure of XML documents. 

The lack of temporal support in commercial DBMS can be attributed to the 
limitations of SQL, the engineering complexity, and the difficulty to implement 
it incrementally [24]. 

Publishing Relational Databases in XML. There is much current interest 
in publishing relational databases in XML. A middleware-based approach is used 
in SilkRoute [25] and XPERANTO [6]. For instance, XPERANTO can build a 
default view on the whole relational database, and new XML views and queries 
upon XML views can then be defined using XQuery. XQuery statements are 
then translated into SQL and executed on the RDBMS engine. SQL/XML is 
emerging as a new SQL standard supported by several DBMS vendors [4, 26], to 
extend RDBMS with XML support. 

Time in XML. Some interesting research work has recently focused on the 
problem of representing historical information in XML. In [27] an annotation- 
based object model is proposed to manage historical semistructured data, and 
a special Clrorel language is used to query changes. In [28] a new <valid> 
markup tag for XML/HTML documents is proposed to support valid time on 
the Web, thus temporal visualization can be implemented on web browsers with 
XSL. In [29], a dimension-based method is proposed to manage changes in XML 
documents, however how to support queries is not discussed. 

In [30], a data model is proposed for temporal XML documents. However, 
since a valid interval is represented as a mixed string, queries have to be sup- 
ported by extending DOM APIs or XPatlr. Similarly, in [31,32], extensions of 
XPath is needed to support temporal semantics. (In our approach, we instead 
support XPath/XQuery without any extension to XML data models or query 
languages.) A rXQuery language is proposed in [33] to extend XQuery for tem- 
poral support, which has to provide new constructs for the language. 

An archiving technique for scientific data using XML was presented in [34] , 
but the issue of temporal queries was not discussed. Both the schema proposed 
in [34] and our schema are generalizations of SCCS [35]. 




Fig. 1. TEER Schema of Employees and Departments (with Time Semantics Added) 
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3 An Example 

The Temporal EER Model (TEER) [16] extends the temporal semantics into the 
existing EER modeling constructs, and works for both valid time and transaction 
time. TEER model associates each entity with a lifespan, and an attribute’s value 
history is grouped together, and assigned with a temporal element (a union 
of valid temporal spans). Each relationship instance is also associated with a 
temporal element to represent the lifespan. 

This temporal ER model is believed by the authors to be more natural 
to manage temporal aspects of data than in a tuple-oriented relational data 
model [16]. Suppose that we have two relations employees and departments, 
and each employee has a name, title, salary, and dept (name is the key), and 
each dept has a name and manager(name is the key). To model the history of 
the two relations, we use a TEER diagram as shown in Figure 1. (For simplicity, 
only valid time is considered, and transaction time can be modeled in a similar 
way.) Figure 1 looks exactly like a normal ER diagram except that the time 
semantics is added. 

In this schema, the entity employee (or e ) will have the following temporal 
attribute values: 

SURROGATE(e) = { [1995 - 01 - 01 , now] -> surrogate_id} 

NAME (e) = { [1995-01-01, now] ->Bob] 

TITLE (e) = ([1995-01-01,1997-12-31] -> Engineer, 

[1998 - 01 - 01, now] -> (Sr Engineer] } 

SALARY (e) = { [1995 - 01 - 01 , 1997 - 12 - 31] - > 65000 , 
[1998-01-01,1999-12-31] -> 70000, 

[2000-01-01, now] -> 85000} 

Here each attribute value is associated with a valid time lifespan, surrogate is 
a system-defined identifier, which can be ignored if the key doesn’t change. 

The following is the list of temporal attribute values of entity dept (or d) : 

SURROGATE(d) = { [1995 - 01 - 01 , now] -> surrogate_id] 

NAME (d) = { [1995-01-01, now] -> RD] 

Similarly, for the instance rb of the relationship belongs.to between em- 
ployee ‘Bob’ and dept ‘RD’, the lifespan is T (rb) = [1995 - 01 - 01, now] , and for 
the instance rm of the relationship manages between employee ‘Mike’ and dept 
‘RD’, the lifespan is T (r) = [1999 - 01 - 01, now] . 

In the next section, we show that such temporal ER model can be supported 
well with XML. 



4 Valid Time History in XML 

While transaction time identifies when data was recorded in the database, valid 
time concerns when a fact was true in reality. One major difference is that 
while transaction time is appended only and cannot be updated, valid time can 
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Name 


Title 


Dept 


Salary 


Start 


End 


Bob 


Engineer 


RD 


65000 


1995-01-01 


1997-12-31 


Bob 


Sr Engineer 


RD 


70000 


1998-01-01 


1999-12-31 


Bob 


Sr Engineer 


RD 


85000 


2000-01-01 


now 



Fig. 2. Valid Time History of Employees 

<employees vstart="1995-01-01" vend- 'now"> 

<employee vstart- ' 1995-0 1-01" vend="now"> 

<name vstart=" 1995-0 1-01" vend="now">Bob</name> 

<title vstart-' 1995-01-01" vend=" 1997-12-31 ">Engineer</title> 

<title vstart=" 1998-0 1-01" vend= now">Sr Engineer</title> 

<dept vstart="1995-01-01 vend="now">RD</dept> 

<salary vstart- T995-01-01" vend="1997-12-31">65000</salary> 

<salary vstart=" 1998-01-01" vend=" 1999-12-31 ">70000</salary> 

<salary vstart="2000-01-01" vend=’ now">85000</salary> 

</employee> 

</employees> 

Fig. 3. XML Representation of the Valid-time History of Employees(VH-document) 



be updated by users. We show that, with XML, we can model the valid time 
history naturally. 

Figure 2 shows a valid time history of employees, where each tuple is times- 
tamped with a valid time interval. This representation assumes valid time homo- 
geneity, and is temporally ungrouped [9]. It has several drawbacks: first, redun- 
dancy information is preserved between tuples, e.g., Bob’s department appeared 
the same but was stored in all the tuples; second, temporal queries need to fre- 
quently coalesce tuples, which is a source of complications in temporal query 
languages. 

These problems can be overcome using a representation where the times- 
tamped history of each attribute is grouped under the attribute [9]. This pro- 
duces a hierarchical organization that can be naturally represented by the hi- 
erarchical XML view shown in Figure 3 (VH-document). Observe that every 
element is timestamped using two XML attributes vstart and vend. 

In the VH-document, each element is timestamped with an inclusive valid 
time interval (vstart, vend), vend can be set to now to denote the ever-increasing 
current date, which is internally represented as “9999-12-31” (Section 4.2). Please 
note that an entity (e.g., employee ‘Bob’) always has a longer or equal lifespan 
than its children, thus there is a valid time covering constraint that the valid 
time interval of a parent node always covers that of its child nodes, which is 
preserved in the update process(Section 4.3). 

Unlike the relational data model that is almost invariably depicted via tables, 
XML is not directly associated with a graphical representation. This creates the 
challenge and the opportunity of devising the graphical representation most con- 
ducive for the application at hand — and implementing it using standard XML 
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85000 


now 


now 


now 
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Fig. 4. Temporally Grouped Valid Time History of Employees 



tools such as XSL [36]. Figure 4 shows a representation of temporally grouped ta- 
bles that we found effective as user interface (and even more so after contrasting 
colored backgrounds and other browser-supported embellishments). 

4.1 Valid Time Temporal Queries 

The data shown in Figure 4 is the actual data stored in the database — with the 
exception of the special “now” symbol discussed later. Thus a powerful query 
language such as XQuery can be directly applied to this data model. In terms 
of data types, XML and XQuery support an adequate set of built-in temporal 
types, including datetime, date, time, and duration [5]; they also provide a com- 
plete set of comparison and casting functions for duration, date and time values, 
making snapshot and period-based queries convenient to express in XQuery. Fur- 
thermore, whenever more complex temporal functions are needed, they can be 
defined using XQuery functions that provide a native extensibility mechanism 
for the language. 

Next we show that we can specify temporal queries with XQuery on the 
VH-document, such as temporal projection, snapshot queries, temporal slicing, 
temporal joins, etc. 

Query VI: Temporal projection: retrieve the history of departments where Bob 
was employed: 

<dept> 

for $s in doc ( "emps .xml ") /employees/employee [name="Bob" ] /dept 
return $s 
</dept> 

Query V2: Snapshot: retrieve the managers of each department on 1999-05-01: 

for $m in doc ( "depts .xml " ) /depts/ 

dept/mgrno [vs tart (.)<=" 1999 - 05 - 01 " and vend (.)>=" 1999 - 05 - 01 " ] 
return $m 

Here depts. xml is the VH-document that includes the history of dept names 
and managers. vstartO and vend!) are user-defined functions (expressed in 
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XQuery) that return the starting date and ending date of an element’s valid 
time respectively, thus the implementation is transparent to users. 

Query V3: Continuous Period: find employees who worked as a manager for 
more than 5 consecutive years ( i.e., 1826 days): 

for $e in doc ( " emps . xml " ) /employees/employee [ title="Manager " ] 
for $t in $e/title [ . ="Manager" ] 

let $duration := subtract - dates ( vend($t), vstart($t) ) 
where dayTimeDuration- greater- than ($duration, "P1826D") 
return $ e/name 

Here “P1826D” is a duration constant of 1826 days in XQuery. 

Query V4: Temporal Join: find employees who were making the same salaries 
on 2001-04-01: 

for $el in doc ( "emps .xml ") /employees/employee 
for $e2 in doc ( "emps .xml ") /employees/employee 
where $el/ salary [vs tart (.)<= '2001-04-01' 
and vend ( . ) >= '2001-04-01'] = 

$e2/salary [vstart ( . ) <= '2001-04-01' and vend (.)>=' 2 001 - 04 - 01 ' ] 
and $el/name != $e2/name 
return ($el/name , $e2/name) 

This query will join emps. xml with itself. It is also easy to support since and 
until connectives of first-order temporal logic [18], for example: 

Query V5: A Until B: find the employee who was hired and worked in dept 
“RD” until Bob was appointed as the manager of the dept: 

for $e in doc ( "emps . xml ") /employees/employee 

for $b in doc ( "emps . xml ") /employees/employee [name=' Bob' ] 

let $t := $b/title [ . ='manager ' ] 

let $bd := $b/dept [ . = ' RD' ] 

let $d := $e/dept [1] [.='RD'] 

where vmeets($d, $t) and vcontains ($bd, $t) 

return <employee> { $ e/name} </ employee> 



4.2 Temporal Operators 

In the temporal queries, we used functions such as vstart and vend to shield 
users from the implementations of representing time. Functions predefined in- 
clude: timestamp referencing functions, such as vstart, vend; interval compari- 
son functions, such as voverlaps, vprecedes, vcontains, veguals, vmeets, 
voverlapinterval: and during and date/time functions, such as vtimespan, 
vinterval. For example, vcontains is defined as follows: 

define function vcontains ($a, $b) { 
if ($a/@vstart<= $b/@vstart and $a/@vend >= $b/@vend ) 
then trued 
else falseO 

} 
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Internally, we use “end-of-time” values to denote the ‘now’ and ‘UC’ symbol. 
For instance for dates we use “9999-12-31.” The user does not access this value 
directly, but accesses it through built-in functions. For instance, to refer to the 
ending valid time of a node s, the user uses the function vend(s) , which returns 
s’s end, if this is different from ‘9999-12-31” and CURRENT_date otherwise. 
The nodes returned in the output, normally use the “9999-12-31” representation 
used for internal data. However, for data returned to the end-user, two different 
representations are preferable. One is to return the CURRENT_date by applying 
function rvendQ that, recursively, replaces all the occurrence of “9999-12-31” 
with the value of CURRENT_DATE. The other is to return a special string, such 
as now to be displayed on the end-user screen. 

These valid-time queries are similar to those transaction time history, as dis- 
cussed in [13]. However, unlike transaction-time databases, valid time databases 
must also support explicit update. This is not discussed in [13] and will be dis- 
cussed next. 

4.3 Database Modifications 

An update task force is currently working on defining standard update constructs 
for XQuery [37]; moreover, update constructs are already supported in several 
native XML databases [38] . Our approach to temporal updates consists in sup- 
porting the operations of insert, delete, and update via user-defined functions. 
This approach will preserve the validity of end-user programs in the face of dif- 
ferences between vendors and evolving standards. It also shields the end-users 
from the complexity of the additional operations required by temporal updates, 
such as the coalescing of periods, and the propagation of updates to enforce the 
covering constraints. 

INSERT. When a new entity is inserted, the new employee element with its 
children elements is appended in the VH-Document; the vstart attributes are 
set to the valid starting timestamp, and vend are set to now. Insertion can be 
done through the user-defined function VInsert ($path,$newelement) . The new 
element can be created using the function VNewElement (Svalueset, Svstart, 
$vend) . 

For example, the following query inserts Mike as an engineer into RD dept 
with salary 50K, starting immediately: 

for $s in doc ( "emps . xml ") /employees/employee [last () ] 
return VInsert ($s, VNewElement ( 

["Mike", "Engineer", "RD", "50000"], current - date (), "now" )) 

DELETE. There are two types of deletion: deletion without valid time and 
deletion with valid time. The former assumes a default valid time interval: (cur- 
rent-date, forever), and can be implemented with the user defined function vn- 
odeDelete ($path) . For deletion with a valid time interval v on node e, there 
can be three mutually exclusive cases: (i) e is removed if its valid time interval 
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is contained in v, (ii) the valid time interval of e is extended if the two intervals 
overlap, but do not contain each other, or (iii) e’s interval is split if it properly 
contains v. Deletions on a node are then propagated downward to its children 
to satisfy the covering constraint. Node deletion (with downward propagation) 
is supported by the function VTimeDelete (Spath, Svstart, Svend) . 

UPDATE. Updates can be on values or valid time, and coalescing is needed. 
There are two functions defined: VNodeReplace (Spath, SnewValue) , and VTime 
Replace (Spath, Svstart, Svend) . For value update, propagation is not needed; 
for valid time update, it is needed to downward update the node’s children’s 
valid time. If a valid time update on a child node violates the valid time covering 
constraint, then the update will fail. 

5 Viewing Transaction Time History as XML 

In [13] we have proposed an approach to represent the transaction-time his- 
tory of relational databases in XML using a temporally grouped data model. 
This approach is very effective at supporting complex temporal queries using 
XQuery [5], without requiring changes in this standard query language. 

In [13] we used these features to show that the XML-viewed transaction 
time history(TH-document) can be easily generated from the evolving history of 
the databases, and implemented by either using native XML databases or, after 
decomposition into binary relations, by relational databases enhanced with tools 
such as SQL/XML [4]. We also showed that XQuery without modifications can 
be used as an effective language for expressing temporal queries. 

A key issue not addressed in [13] was whether this approach, and its unique 
practical benefits of only requiring off-the-shelf tools, can be extended to support 
bitemporal databases. With two dimensions of time, bitemporal databases have 
much more complexity, e.g., coalescing on two dimensions, explicit update com- 
plexity, and support of more complex bitemporal queries. In the next section, 
we explore how to support a bitemporal data model based on XML. 

6 An XML-Based Bitemporal Data Model 

6.1 The XBiT Data Model 

In practice, temporal applications often involve both transaction time and valid 
time. We show next that, with XML, we can naturally represent a temporally 
grouped data model, and provide support for complex bitemporal queries. 

Bitemporal Grouping. Figure 5 shows a bitemporal history of employees, us- 
ing a temporally ungrouped representation. Although valid time and transaction 
time are generally independent, for the sake of illustration, we assume here that 
employees’ promotions are scheduled and entered in the database four months 
before they occur. 

XBiT supports a temporally grouped representation by coalescing attributes’ 
histories on both transaction time and valid time. Temporal coalescing on two 
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Name 


Title 


Dept 


Salary 


Valid Time 


Transaction Time 


Bob 


Engineer 


RD 


65000 


1995-01-01:now 


1995-01-01:1997-08-31 


Bob 


Engineer 


RD 


65000 


1995-01-01:1997-12-31 


1997-09-01 :UC 


Bob 


Sr Engineer 


RD 


70000 


1998-0 1-0 Enow 


1997-09-01:1999-08-31 


Bob 


Sr Engineer 


RD 


70000 


1998-01-01:1999-12-31 


1999-09-01 :UC 


Bob 


Sr Engineer 


RD 


85000 


2000-0 1-0 Enow 


1 999-09-0 1:UC 



Fig. 5. Bitemporal History of Employees 



temporal dimensions is different from coalescing on just one. On one dimension, 
coalescing is done when: i) two successive tuples are value equivalent, and ii) 
the intervals overlap or meet. The two intervals are then merged into maximal 
intervals. 

For bitemporal histories, coalescing is done when two tuples are value-equiv- 
alent and (i) their valid time intervals are the same and the transaction time 
intervals meet or overlap; or (ii) the transaction time intervals are the same 
and the valid time intervals meet or overlap. This operation is repeated until no 
tuples satisfy these conditions. 

For example, in Figure 5, to group the history of titles with value ‘Sr Engi- 
neer 1 in the last three tuples, i.e. , (title, valid-time, transaction-time), the last 
two transaction time intervals are the same, so they are coalesced as (Sr En- 
gineer, 1998-01-01: now, 1999 - 09 - 01 :UC) . This one again has the same valid 
time interval as the previous one: ( (Sr Engineer, 1998 - 01 - 01 :now, 1997-09- 
01:1999-08-31), thus finally they are coalesced as (Sr Engineer, 1998-01- 
0 1 : now , 1997 - 09 - 01 :UC) , as shown in Figure 7. 



Data Modeling of Bitemporal History with XML. With temporal group- 
ing, the bitemporal history is represented in XBiT as an XML document (BH- 
document). This is shown in the example of Figure 6, which is snapshot-equiv- 
alent to the example of Figure 5. Each employee entity is represented as an 
employee element in the BH-document, and table attributes are represented as 
employee element’s child elements. Each element in the BH-document is assigned 
two pairs of attributes tstart and tend to represent the inclusive transaction 
time interval, and vstart and vend to represent the inclusive valid time inter- 
val. Elements corresponding to a table attribute value history are ordered by 
the starting transaction time tstart. The value of tend can be set to UC (until 
changed), and vend can be set to now. There is a covering constraint whereby 
the transaction time interval of a parent node must always cover that of its child 
nodes, and likewise for valid time intervals. 

Figure 7 displays the resulting temporally grouped representation, which is 
appealing to intuition, and also effective at supporting natural language inter- 
faces, as shown by Clifford [7]. 
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<employees vstart=" 1995-01 -01" vend-'now" tstart=" 1995-0 1-01" tend="UC"> 

<employee vstart="1995-01-01" vend-'now" tstart="1995-01-01" tend-'UC"> 

<name vstart=" 1995-0 1-01" vend="now" tstart- '1995-01-01" tend="UC">Bob</name> 

<title vstart=" 1995-0 1-01" vend-'now" tstart- ' 1995-01 -01" tend= 1997-08-3 1 "> Engineer</title> 
<title vstart=" 1995-0 1-01" vend="1997-12-31" tstart=" 1997-09-01" tend="UC"> Engineer</title> 
<title vstart=" 1998-0 1-01" vend="now" tstart- ' 1997-09-01" tend="UC">Sr Engineer</title> 

<dept vstart- '1995-01-01" vend~"now" tstart- ’1995-01-01" tend~"UC">RD</dept> 

<salary vstart- ' 1995-01-01 vend="now" tstart- ' 1 995-0 1-01" tend=" 1 997-08-3 1 ">65000</salary> 
<salary vstart="1995-01-01 " vend=" 1997- 12-31" tstart=" 1997-09-01” tend="UC">65000</salary> 
<salary vstart-' 1998-01 -01 " vend="now" tstart- ' 1 997-09-0 1 " tend=" 1 999-08-3 1 ">70000</salary> 
<salary vstart- '1998-01-01 " vend=" 1999- 12-31" tstart=" 1999-09-01" tend="UC">70000</salary> 
<salary vstart="2000-01-01 vend="now" tstart=" 1999-09-01" tend="UC">85000</salary> 
</employee> 

</employees> 

Fig. 6. XML Representation of the Bitemporal History of Employees(BH-document) 



6.2 Bitemporal Queries with XQuery 

The XBiT-based representation can also support powerful temporal queries, ex- 
pressed in XQuery without requiring the introduction of new constructs in the 
language. We next show how to express bitemporal queries on employees. 

Query Bl: Temporal projection: retrieve the bitemporal salary history of em- 
ployee “Bob”: 

<salary_history> 

for $s in doc ( "emps . xml ") /employees/employee [name="Bob" ] /salary 
return $s 
</ salary_history> 

This query is exactly the same as query VI, except that it retrieves both 
transaction time and valid time history of salaries. 

Query B2: Snapshot: according to what was known on 1999-05-01, what was 
the average salary at that time? 

let $s := doc ( "emps .xml" ) /employees/employee/salary 
where tstart ($s) <="1999 - 05 - 01" and tend($s) >= "1999-05-01" 

and vstart ($s) <="1999 - 05 - 01" and vend($s) >= "1999-05-01" 

return avg($s) 

Here tstart ( ) , tend ( ) , vstart ( ) and vend ( ) are user-defined functions that 
get the starting date and ending date of an element’s transaction-time and valid- 
time, respectively. 

Query B3: Diff queries: retrieve employees whose salaries (according to our 
current information) didn’t changed between 1999-01-01 and 2000-01-01: 

let $s := doc ( "emps .xml" ) /employees/employee/salary 
where tstart ($s) <=current - date () and tend ($s) >=current - date ( ) 
and vstart ($s) <="1999 - 01 - 01" and vend($s)>= "2000-01-01" 
return $s/ . . 
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v:now t: 1997-08-31 




v:now 
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v:1998-01-01 t:1997-09-01 


RD 


v: 1998-0 1-01 t:1997-09-01 
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t:UC 




t:UC 
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v:now 


v:now t:UC 


v:now 


v:now t:UC 



Fig. 7. Temporally Grouped Bitemporal History of Employees 



This query will take a transaction time snapshot and a valid time slicing of 
salaries. 

Query B4: Change Detection: find all the updates of employee salaries that 
were applied retroactively. 

for $s in doc ( " emps . xml " ) /employees/employee/salary 
where tstart($s) > vstart($s) or tend($s) > vend($s) 

Query B5: find the manager for each current employee, as best known now: 

for $e in doc ( "emps . xml ") /employees/employee 
for $d in doc ( "depts .xml " ) /depts/dept/name [ . =$e/dept] 
where tend ( $e) ="UC" and tend ($d) ="UC" 
and vend ($e) ="now" and vend ($d) ="now" 
return $e, $d 

This query will take the current snapshot on both transaction time and valid 
time. 

6.3 Database Modifications 

For valid time databases, both attribute values and attribute valid time can be 
updated by users, and XBiT must perform some implicit coalescing to support 
the update process. Note that only elements that are current (ending transaction 
time as UC ) can be modified. A modification combines two processes: explicit 
modification of valid time and values, and implicit modification of transaction 
time. 



Modifications of Transaction Time Databases. Transaction time modifi- 
cations can also be classified as three types: insert, delete, and update. 
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INSERT. When a new tuple is inserted, the corresponding new element (e.g., 
employee ‘Bob’) and its child elements in BH-document are timestamped with 
starting transaction time as current date, and ending transaction time as UC. 
The user-defined function Tinsert (Snode) will insert the node with the trans- 
action time interval(current date, UC). 

DELETE. When a tuple is removed, the ending transaction time of the corre- 
sponding element and its current children is changed to current time. This can 
be done by the function TDelete ($node) . 

UPDATE. Update can be seen as a delete followed by an insert. 

Database Modifications in XBiT. Modifications in XBiT can be seen as the 
combination of modifications on valid time and transaction time history. XBiT 
will automatically coalesce on both valid time and transaction time. 

INSERT. Insertion is similar to valid time database insertion except that the 
added element is timestamped with transaction time interval as (current date, 
UC). 

This can be done by the funciton Blnsert ($path, Snewelement) , which 
combines VInsert and Tinsert. 

DELETE. Deletion is similar to valid time database insertion, except that 
the function TDelete is called to change tend of the deleted element and its 
current children to current date. Node deletion is done through the function 
BNodeDelete ($path) , and valid time deletion is done through the function 
BTimeDelete ($path, Svstart, $vend) . 

UPDATE. Update is also a combination of valid time and transaction time, 
i.e., deleting the old tuple with tend set to current date, and inserting the new 
tuple with new value and valid time interval, tstart set to current date and tend 
set to UC. This is done by the functions BNodeReplace (Spath, Snewvalue) and 
BTimeReplace ($path, $vstart, $vend) respectively. 



6.4 Temporal Database Implementations 

Two basic approaches are possible to manage the three types of H-documents 
discussed here: one is to use a native XML database, and the other is to use 
traditional RDBMS. In [13] we show that a transaction time TH-document can 
be stored in a RDBMS and has significant performance advantages on temporal 
queries over a native XML database. Similarly, RDBMS-based approach can be 
applied to the valid history and bitemporal history. First, the BH-document is 
shredded and stored into H-tables. 

For example, the employee BH-document in Figure 6 is mapped into the 
following attribute history tables: 

employee.name (id, name, vstart, vend, tstart, tend) 
employee.title (id, title, vstart, vend, tstart, tend) 
employee.salary (id, salary, vstart , vend, tstart, tend) 
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Since the BH-document and H-tables have a simple mapping relationship, 
temporal XQuery can be translated into SQL queries based on such mapping 
relationship, using the techniques discussed in [13]. 

7 Conclusions 

In this paper, we showed that valid-time, transaction-time, and bitemporal 
databases can be naturally managed in XML using temporally-grouped data 
models. This approach is similar to the one we proposed for transaction-time 
data bases in [13], but we have here shown that it also supports (i) the temporal 
EER model [16], and (ii) valid-time and bitemporal databases with the com- 
plex temporal update operations they require. Complex historical queries, and 
updates, which would be very difficult to express in SQL on relational tables, 
can now be easily expressed in XQuery on such XML-based representations. 

The technique is general and can be applied to historical representations of 
relational data, XML documents in native XML databases, and version manage- 
ment in archives and web warehouses [12]. It can also be used to support schema 
evolution queries [39]. 
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The area of business operations monitoring and management is rapidly gaining impor- 
tance both in the industry and in the academia. This is demonstrated by the large 
number of performance reporting tools that have been developed. Such tools essen- 
tially leverage system monitoring and data warehousing applications to perform 
online analysis of business operations and produce fancy charts, from which users can 
get the feeling of what is happening in the system. While this provides value, there is 
still a huge gap between what is available today and what users would ideally like to 
have 1 : 

• Business analysts tend to think of the way business operations are performed in 
terms of high level business processes, that we will call abstract in the following. 
There is no way today for analyst to draw such abstract processes and use them as 
a metaphor for analyzing business operations. 

• Defining metrics of interest and reporting against these metrics requires a signifi- 
cant coding effort. No system provides, out of the box, the facility for easily defin- 
ing metrics over process execution data, for providing users with explanations for 
why a metric has a certain value, and for predicting the future value for a metric. 

• There is no automated support for identifying optimal configurations of the busi- 
ness processes to improve critical metrics. 

• There is no support for understanding the business impact of system failures. 

The Enterprise Cockpit (EC) is an "intelligent" business operation management 
platform that provides the functionality described above. In addition to providing 
information and alerts about any business operation supported by an IT infrastructure, 
EC includes control and optimization features, so that managers can use it to auto- 
matically or manually intervene on the enterprise processes and resources, make 
changes in response to problems, or identify optimizations that can improve business- 
relevant metrics. In the following, we sketch the proposed solution 2 . 

The basic layer of EC is the Abstract Process Monitor (APM), that allows users to 
define abstract processes and link the steps in these processes with events (e.g., access 
to certain Web pages, invocation of SAP interface methods, etc.) occurring in the 
underlying IT infrastructure In addition to monitoring abstract processes, EC lever- 
ages other business operation data, managed by means of "traditional" data warehous- 
ing techniques, and therefore not discussed further here. 

Once processes have been defined, users can specify metrics or SLAs over them, 
through the metric/SLA definer. For example, analysts can define a success metric 



1 We name here just a few of the many issues that came out at a requirements gathering work- 
shop held last fall in Palo Alto. 

2 A more detailed paper is available on request. 
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stating that a payment process is successful if it ends at the "pay invoice" node and is 
completed within 4 days from the time the invoice has been received. Metrics are 
defined by means of a simple web-based GUI and by reusing metric templates either 
built into APM or developed by consultants at solution deployment time. Once met- 
rics have been defined, the metric computation engine takes care of computing their 
values. In addition, EC computes distributions for both process attributes (such as the 
duration of each step and of the whole process, or the process arrival rate) and met- 
rics. This is done by the curve fitting module. For example, users can discover that the 
duration of the check invoice step follows a normal distribution with a given mean 
and variance, or that the process arrival rate follows an exponential distribution. 

EC also provides features to help users make the most out of this information and 
really understand which things go wrong, why, what is their impact on the business, 
and how to correct problems. One of these features is process analysis, performed by 
the analysis and prediction engine. This consists in providing users with explanation 
for why metrics have certain values (e.g., why the cost is high or the success rate is 
low). To this end, EC integrates algorithms that automatically mine the EC databases 
and extract decision trees, which have a graphical formalism that makes it easy, even 
for business users, to examine correlations between metric values and other process 
attributes or metrics and identify the critical attributes affecting metric deviations 
from desired values. For example, users can see that unsuccessful processes are often 
characterized by invoices from a certain supplier arriving on a certain day. The hard 
challenge here is how to prepare the data (collected among the ocean of information 
available by the different data logs) to be fed to the mining algorithm, how to do this 
in an automated fashion (without human supervision), and in a way that works for 
every process and every metric. We addressed this challenge by confining the prob- 
lem (we do analysis and prediction over metric data defined over abstract processes), 
by leveraging the fact that we have a rich, self-describing process and metric meta- 
model and therefore could write data preparation programs that can gather all the 
potentially useful process and 
metric data, and by leveraging 
experimental knowledge about 
which process features are most 
typically correlated with metric 
values. 

Another feature, essentially 
based on the same mining 
technology, is to provide users 
with a prediction of the value that 
a metric will have at the end of a 
process, or whether an SLA will 
be violated or not. Predictions are 
made at the start of the process 
and are updated as process 
execution proceeds. To this end, a 
family of decision (or regression) 
trees is built for each abstract 
process. In addition to the predicted value, users are provided with a confidence value 
that indicates the probability that the prediction will happen. 
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Metric analysis and predictions are useful tools in their own right, but they leave 
the burden of optimization to the users. Hence, EC also includes an optimization 
component, that suggests improvements to the enterprise process based on business 
goals, expressed in terms of desired metric values defined over abstract processes. 
This is achieved by leveraging process simulation techniques: Users can state that 
they want to optimize an abstract process so to minimize or maximize the value of a 
certain metric. EC will then simulate the execution of several alternative process con- 
figurations corresponding to that abstract process (for example, will try to allocate 
human and automated resources in different ways, while meeting resource constraints 
defined by the user), will compute metrics out of the simulated data, and will conse- 
quently identify the configuration that best meets the user's goals. EC also optimizes 
the search among the many possible process configurations, although in the current 
version we use simple heuristics for this purpose. 

Finally, we stress that all of the above features are provided in a fully automated 
fashion, at the click of the mouse. This is in sharp contrast with the way that, for ex- 
ample, data mining or process simulation packages are used today, requiring heavy 
manual intervention and lengthy consulting efforts. 
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The catalog function is an essential feature in B2C and B2B e-commerce. While 
catalog is primarily for end users to navigate and search for interested prod- 
ucts, other e-commerce functions such as merchandising, order, inventory and 
aftermarket constantly refer to information stored in the catalog [1]. The billion- 
dollar mail order business was created around catalog long before e-commerce. 
More opportunities surface after catalog content previously created on paper is 
digitized. While catalog is recognized as a necessity for a successful web store, 
its content structure varies greatly across industries and also within each indus- 
try. Product categories, attributes, measurements, languages, and currency all 
contribute to the wide variations, which create a difficult dilemma for catalog 
designers. 

We have recently encountered a real business scenario that challenges tradi- 
tional approaches of modeling and building e-catalog. We were commissioned to 
build an in-store shopping solution for branches in retail store chains. The local 
catalog at a branch is a synchronized copy of selected enterprise catalog content 
plus branch specific information, such as item location on the shelf. A key busi- 
ness requirement, which drives up the technical challenge, is that the in-store 
catalog solution needs to interoperate with the retail chain’s legacy enterprise 
catalog or its catalog software vendor of choice. This requirement reflects the 
business reality that decisions to pick enterprise software and branch software 
are usually not made simultaneously nor coordinated. As we learned that hun- 
dreds of enterprise catalog software, legacy and recent, is being used in industries 
such as grocery, clothing, books, office staples and home improvement, our chal- 
lenge is to create a catalog model that is autonomously adapting to the content 
of enterprise catalog in any of the industries. 

A straightforward answer to the challenge is to build a mapping tool that 
will convert enterprise catalog content to the pre-designed in-store catalog, but 
this approach is highly undesirable. The difficulty lies within that it is impos- 
sible to predict the content to be stored. A simple example to illustrate the 
difficulty is by looking at what is stored in catalog for Home Depot, a home fur- 
nishing retailer, and by examining what is stored in catalog for Staples, an office 
equipment retailer. A kitchen faucet sold at Home Depot has information about 
its size, weight, material, color, and style. On the other hand, a fax machine 
sold at Staples carries attributes such as speed, resolution, and tone dialing. 
These attributes need to be stored in the catalog for retrieval and product com- 
parisons. Without knowing where a catalog will be used, our design obviously 
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cannot pre-set the storage schema for either faucets or fax machines. Needless 
to say, there are hundreds of thousands of products whose information needs to 
be stored in catalogs. Today’s catalog solutions in the market also suffer from 
over design, which leads to wasted storage space, over-normalized schema and 
poor performance. Multi-language support, currency locale, geological label and 
access control are commonly embedded and inseparable from the main catalog 
functions. Suppose a company only operates stores in California. The additional 
features can turn highlights to burden. 

Further enhancing the shortfall of the traditional catalog modeling and map- 
ping approach is the lack of configurability and optimization. Customization 
made on small delta changes to the catalog data model propagates in a magnified 
way all the way up to business logic and presentation layers. Furthermore, the 
vertical schema to store catalog attributes in name- value pairs distorts database 
statistics and makes catalog queries hard to optimize [3] [4] . We foresee no easy 
way to continue the traditional methodology for a satisfactory solution to our 
problem. 

In this paper, we propose a set of abstracted catalog semantics to model an 
autonomous catalog to become the in-store catalog solution. An autonomous cat- 
alog exhibits two key properties of autonomic computing: self-configuration and 
self-optimization [2], It receives definitions of catalog entities from enterprise 
catalog to synthesize and create persistent storage schema and programming 
access interface. It buffers objects for cached retrieval and learns from search 
history to create index for performance. The use of autonomous catalog requires 
little learning and training since it morphs into enterprise catalog content struc- 
ture. Changes can be reflected instantly at storage schema and programmatic 
interfaces. 

We model this autonomous catalog by associations of basic categorical en- 
tities. A categorical entity is defined as a named grouping of products that 
share similar attributes. Instances of a categorical entity are physical, procur- 
able products or services. For example, the kitchen faucet may be declared as a 
categorical entity and one of its instances is Moen Asceri. A categorical entity 
may be pointing to one or more categorical entities to establish parent or child 
category relationship. Attributes in a categorical entity may be completely dif- 
ferent from those in another and yet in both cases, they are efficiently stored in 
a normalized schema without applying the vertical schema. 

We define five operations including add, update, delete, search and retrieve 
on categorical entity. To shield software developers from accessing instances of 
categorical entities directly, these five catalog operations can only be executed 
through a programming language interface such as Java. When a new entity is 
declared by the enterprise catalog in XML Schema expression, new Java classes 
and interfaces, following a predefined template of these five operations, will be 
automatically synthesized. 

For example, the enterprise may declare an entity named ‘Kitchen Faucet’ 
with five attributes. Our autonomous catalog then creates tables in the database 
to store instances of faucets and synthesizes a Java class with methods to popu- 
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late, retrieve and search the instances by attribute values. Kitchen faucet may be 
associated with plumbing and kitchen categories. The Java class has methods to 
support searches from the associated categories. Revisiting the aforementioned 
catalog features such as multi-language support, we can easily add new attributes 
describing kitchen faucet in foreign languages applicable to use cases. There is 
no unused space for catalog attributes not needed. 

Another advantage of the autonomous catalog is its ability to capture more 
sophisticated modeling semantics at runtime, due to the flexibility of program- 
ming language wrapper. For example, in the synthesized Java class, program- 
matic pointers can reference an external taxonomy or ontology for runtime infer- 
encing. Catalog content linked to a knowledge management system can support 
more intelligent queries such as ’which kitchen faucets are recommended for 
water conservation?’ This further brings catalog modeling beyond the inclusive 
entity-relationship diagram. 

The modeling of autonomous catalog enables it to re-configure itself while 
administrators and programmers are shielded from knowing the details in man- 
aging the flexible persistent storage. As the Java classes change and evolve to 
adapt to the enterprise catalog content, one can envision that business logic 
that invokes these Java classes to be modeled and generated autonomously as 
well. We are investigating the modeling of merchandising and order tracking to 
demonstrate the feasibility of autonomous modeling of business logic. 
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Sequencing genomes is a fundamental aspect of biological research. Shotgun se- 
quencing, since introduced by Sanger et al [2], has remained the mainstay in the 
research field of genome sequence assembly. This method randomly obtains se- 
quence reads (e.g. a subsequence including about 500 characters) from a genome 
and then assemblies them into contigs based on significant overlap among them. 
The whole-genome shotgun ( WGS) approach, generates sequence reads directly 
from a whole-genome library and uses computational techniques to reassemble 
them. A variety of assembly programs have been previously proposed and imple- 
mented, including PHRAP [3] (Green 1994), CAPS [4] (1999), Celera [5] (2000) 
etc. Because of great computational complexity and increasingly large size, they 
incur great time and space overhead. PHRAP [3], for instance, which can only 
run in a stand-alone way, requires many times memory (usually greater than 10) 
as the size of original sequence data. In realistic applications, sequencing process 
might come to become unacceptably slow for insufficient memory even with a 
mainframe with huge RAM. 

The GiSA (i.e. Grid System for Genome Sequence Assembly) is thus de- 
signed to solve the problem. It is based on Globus Toolkit 3.2. With grid frame- 
work, it exploits parallelism and distribution for improving scalability. Its archi- 
tecture is shown in figure 1. 

The approach of GiSA is designed into a recursive procedure containing two 
steps. The first step partitions the sequence data into several intermediate-sized 
groups in which sequence reads are relevant and can potentially be assembled 
together. Each group can be successfully processed independently in limited 
memory. The second step will be performed to assemble intermediate results 
derived from the first steps in the round. In this way, we can handle dramatically 
large size of biological sequence data. 

GiSA is divided into three layers: client, PHRAP servers, and servers for 
management including BLAST [1] Data Server and Management Data Server 
(. MDS ). The client simply sends assembly request through Web Browser. MDS 
of GiSA is ready to receive request and then GiSA starts working for genome 
sequence assembly. 

PHRAP servers are deployed with Grid Environment (Globus Toolkit 3.2 for 
Linux in our implementation) and gird services for control and communication. 

* This research is supported in part by the Key Program of National Natural Science 
Foundation of China (No. 69933010 and 60303008), and China National 863 High- 
Tech Projects (No. 2002AA4Z3430 and 2002AA231041) 
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Fig. 1 . Grid System for Genome Sequences Assembly 



Information about the grid services such as GSH (Grid Service Handle) is regis- 
tered in MDS. Each grid service works as a single thread to provide parallelism. 
PHRAP is also installed in each PHRAP server to accomplish assembly task. 
Each server continuously receives task from Task Queue on MDS and return 
locally processed result. 

On MDS , several important programs are deployed as threads. Main con- 
trol thread manages all the process variables and schedules the other programs. 
Queue thread constructs and maintains global Task Queue for workload balance. 
Dispatching thread dispatches tasks from Task Queue to PHRAP servers. And 
results-receiving thread collects partial results returned from PHRAP servers. 
All the four threads above are finely designed for synchronization. 

Genome sequence data are stored in the BLAST Data Sever where BLAST 
is available for sequence similarity search. 

The whole procedure works as follows. As the client sends assembly request 
through Web Browser, GiSA starts to run in a recursive reformation. First, con- 
trol thread launches ’formatdb’ program in BLAST package to construct BLAST 
target clb. Then, queue thread randomly selects an unused sequence. BLAST use 
the sequence as a seed for finding sequences which have a promising chance of 
being joined. These sequences are collected in file and ready to be packed as Task 
Element into Task Queue. Dispatching thread dispatches tasks to each PHPAP 
server according to their respective capability. If there is no task in queue cur- 
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rently, it will sleep for a while. When a certain Task Element is dispatched to 
a certain PHRAP server, the server receives task and run PHRAP to align the 
sequences. Multiple PHRAP servers work independently and concurrently. Lo- 
cal assembly results are generated in plain file format and transferred back to 
MDS. After MDS gets the results, it updates server’s capability information for 
future dispatching decision and sequence alignment information for next-round 
use. When a certain portion of sequences has been processed, next round starts. 
New data source file and BLAST clb is reconstructed,. This procedure goes re- 
cursively. It does not cease until no contigs be generated any more and returns 
results to the client. 

Additionally, we design a Web progress bar in JSP format as user interface 
to visualize the undergoing progress. 

Obviously, we can benefit a lot from such an architecture and work flow 
of GiSA. The bottleneck of lacking enough RAM in a single computer is over- 
come by partitioning overall sequence data into smaller clusters. All the available 
service resources of servers contribute to GiSA to accelerate the assembly proce- 
dure. This is the common characteristic of grid system. Moreover, when a server 
finishes earlier than others, it will immediately get another assembly task from 
Task Queue until it is empty. As a result, the computing ability of each server is 
well exerted. In summary, this grid system provides new solutions to large scales 
of genome sequences assembly and it is a meaningful application of Grid in the 
area of sequence assembly. 
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Abstract. This paper describes an example of how the Analytical View (AV) in 
Microsoft Business Framework (MBF) works. AV consists of three compo- 
nents: Design time Model Service, Business Intelligence Entity (BIE) pro- 
gramming model, and the runtime Intell-Drill for navigation between OLTP 
and OLAP data sources. Model Service transforms an "object model (transac- 
tional view)" to a "multi-dimensional model (analytical view)." It infers di- 
mensionality from the object layer where richer metadata is stored, eliminating 
the guesswork that a traditional data warehousing process requires. Model Ser- 
vice also generates BI Entity classes that enable a consistent object oriented 
programming model with strong types and rich semantics for OLAP data. In- 
telli-Drill links together all the information in MBF using metadata, making in- 
formation navigation in MBF fully discover-able. 



1 Introduction 

The goals of the analytical view [1] are to ensure less contention on the transactional 
databases, easier access of information, and tighter integration with the application 
framework’s programming model, such as Microsoft Business Framework (MBF), 
with a focus on prescriptiveness [2], Furthermore, we want to unleash the information 
and data stored in the application through a set of framework level programming 
models so they can be fully leveraged for BI, data mining, and information navigation 
in business applications. 

In MBF, Entity-Relational Maps (ER-Maps) describe how each field in a business 
entity (e.g., a “customer name” in the customer entity) is originated from a column in 
a database table (e.g., the CustomerName column in the Customers table). The 
Model Service infers respective OLAP cubes from the MBF object models - business 
entities in form of metadata. After this model transformation, a set of classes, namely 
the Business Intelligence (BI) Entities, are code generated as well to objectify the 
access to the multi-dimensional data in OLAP cubes. 

AV automatically infers the corresponding analytical model from the transaction 
business logic. This process not only enables BI entities to be generated automatically 
but also preserves the “transformation” logic to offer the full fidelity of the metadata 
describing relationships between business entities and BI Entities. The end result of 
this process is a technical break-through that enables BI Entities to drill back to busi- 
ness entities and navigate among them in design and run -time, using metadata. The 
Intelli-Drill run-time service furthers the idea used by hypermedia [3] for the object 
transversal in an object graph. Figure 1 illustrates the architecture vision for AV. 
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Fig. 1. Our Architecture Vision for Analytical View 




Fig. 2. UML Model for the Example 



2 An Example 

A developer uses MBF to design an application object model using UML (see Fig- 
ure 2.) He maps the entities and relationships to relational objects. He runs Model 
Service to infer the dimensional model from the defined object model and the O-R 
mapping. The translator uses a rules engine to create dimensions and hierarchies. The 
translator first examines the model and determines the “reachable” object from all 
defined measures. A reachable object implies that a path exists to that object through 
relationships of the correct cardinality, from the measures. This insures that the di- 
mensions that are built can “slice” the measures. 
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Fig. 3. Inferred Star Schema for the Example 



The translation engine then generates a Data Source View [4] which describes the 
entities as data sources in the OLAP mode. Object relationships such as associations 
and compositions are emulated by foreign keys understood by the OLAP. Additional 
objects, known as the FACT objects, are built by traversing the object tree rooted by 
each focal point following foreign keys in a many-to-one direction. 

Finally, the translation engine builds a dimensional model from the Data Source 
View. A Sales cube is built with two measure groups [4] and dimensions (Figure 3). 
The measure groups are derived from objects with decorated measures. The rules 
engine determines the structure of the dimensions, rolling up some entities into a 
single dimension and constructing hierarchies with the appropriate levels. 

The deployment engine of the Model Service deploys the dimensional model on a 
specified UDM server and generates the BI Entity code for programmatic access. 

We also introduced a notion of “Smart Report” to make information more accessi- 
ble to the end users wherever they are in a business application by leveraging the 
metadata and Intelli-Drill runtime services. Figure 4 shows a mockup to illustrate the 
idea. In Smart Report, data points are traversable through Intelli-Drill. E.g., when a 
user types information about a customer in a sales order, the user can see the credit 
rating and payment history of this customer. 
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Fig. 4. Sample Smart Report 



3 Conclusions and Future Work 

Traditionally, converting an object model into a dimensional model is done manually 
to re-construct the business logic, which could be lost in the process. Importing data 
from the object model into the dimension model also creates a big overhead for the 
process of data analysis. 

The conversion from an object oriented model to a dimensional model is a new 
concept in OLAP. Often, the two models are not related to each other because people 
who deal with them have different backgrounds. Our break-through automates the 
conversion process and removes the need to reconstruct the business logic. As such, 
we provide a lowest cost of entry point for application developers to include BI or 
data mining functionality in their applications. Most work described here has been 
done and will be part of MBF. We are working diligently to support prescriptive 
navigation using Intelli-Drill. 
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1 Introduction 

One of the main challenges in building enterprise applications has been to balance 
between general functionality and domain/scenario-specific customization. The lack 
of formal ways to extract, distill, and standardize the embedded domain knowledge 
has been a barrier to minimizing the cost of customization. Using ontology, as many 
would hope, will give application builders the much needed methodology and stan- 
dard to achieve the objective of building flexible enterprise solutions [1,2], 

However, even with a rich amount of research and quite a few excellent results on 
designing and building ontologies [3, 4], there are still gaps to be filled for actual 
deployment of the technology and concept in a real life commercial environment. The 
problems are hard especially in those applications that require well-defined semantics 
in mission critical operations. In this presentation, we introduce two of our projects 
where ontological approaches are used for enterprise applications. Based on these 
experiences we discuss the challenges in applying ontology-based technologies to 
solving business applications. 



2 Product Ontology 

In our current project, an ontology system is being built for the Public Procurement 
Services (PPS) of Korea, which is responsible for procurement for government and 
public agencies of the country. The main focus is the development of a system of 
ontologies representing products and services. This will include the definitions, prop- 
erties, and relationships of the concepts that are fundamental to products and services. 
The system will supply tools and operations for managing catalog standards, and will 
serve a standard reference system for e-catalogs. Strong support for semantics of 
product data and processes will allow for dynamic, real-time data integration, and also 
real-time tracking and configuration of products, despite differing standards and con- 
ventions at each stage. 



* This work has been conducted in part under the Joint Study Agreement between IBM T. J. 
Watson Research Center, USA, and the Center for e-Business Technology, Seoul National 
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An important component of the ontology model will be the semantic model for 
product classification schemes such as UNSPSC 1 , since that alone can be used to 
enrich the current classification standards to include machine-readable semantic de- 
scriptions of products. The model will provide a logical structure in which to express 
the standards. The added semantics will enhance the accuracy of mappings between 
classification standards. 



3 Performance Monitoring Ontology 

We extend our discussions to another domain to argue that the issues and principles in 
the above project are not specific to product information, but rather can be generally 
applied to other database applications that deal with diverse semantics. 

In this second project, we attempt to create a worldwide monitoring and collabora- 
tion platform for petroleum surveillance engineers of a major oil production and dis- 
tribution company. The job of a surveillance engineer is to constantly monitor multi- 
ple time series sensor data, which takes measurements of production equipment 
outputs as well as natural environmental factors. An ontology to describe operational 
data and events in association with oil production is expected to serve as the reference 
to all operational sensor data and all equipment failure monitors. A performance 
monitoring ontology primarily serves three objectives in our system. First, the ontol- 
ogy organizes the matrix of sensor data in a semantically meaningful way for engi- 
neers to navigate and browse. Second, through ontology, the pattern recognizers can 
be de-coupled from actual sensor data, which may be added, upgraded, and retired in 
the lifetime of a well. Third, the ontology helps to link treatment actions to pending 
failure events. The use of ontology for performance monitoring appears through the 
working loop of sense, alert, decision, and reaction. 



4 Discussions and Conclusion 

Based on our experiences from the projects described above, we discuss some of the 
practical issues that hinder widespread use of ontology-based applications in enter- 
prise settings. We came to realize the lack of modeling methodology, domain user 
tools, persistent storage, lifecycle management and access control for the creation, 
use, and maintenance of ontology on a large, deployable scale. While our engagement 
is specific to government procurement and oil production, we believe that one can 
infer this paradigm to similar business applications in other industries. 

Modeling: Level of abstraction problem haunts all aspects of ontology design. Multi- 
ple views and taxonomies, often with conflicting semantics, present another challenge 
for the field engineer. 

Ontology - DB Integration: The ontology can be modeled as meta data for the data- 
base, where the database alone represents the information content of the system and 
the ontology is a secondary facility. On the other hand, the ontology can be modeled 
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as an integral part of the database, in which case, ontology must be part of all queries 
and operations. Trade-off includes implementation complexity, semantic richness, and 
efficiency. 

Ontology Lifecycle Management: Populating the ontology is a daunting task which 
can make or break the project. The job is complicated by multiple formats, semantic 
mismatches, errors or dirty data in pre-existing information sources. Change man- 
agement (versions, mergers, decompositions, etc.) is another complicated issue. 

Accountability and Control: One of the biggest concerns inhibiting ontology adop- 
tion in enterprise applications is its lack of control. When is an ontology complete, in 
the sense that it holds sufficient content to support all mission critical operations? Is 
the behavior/performance predictable? 

Human Factors: Building and maintaining the ontology requires much more than 
software engineers. Domain experts must define the concepts and relationships of the 
domain model. Ontological information model is not a concept easily understood by 
non-computer/ontology experts. A set of intuitive guidelines must be provided. Easy- 
to-use tools are also essential. 

Through this presentation, we wish to share our experiences on these issues and so- 
lutions to some of them. The solutions to these problems are most likely to come as 
disciplines, guidelines, and tools that implement these guidelines. In our future re- 
search, we plan to build a map that links individual ontological requirements to ontol- 
ogy issues, and then to applicable ontology technology. 
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Building very big taxonomies is a laborious task vulnerable to errors and man- 
agement/scalability deficiencies. FASTAXON is a system for building very big 
taxonomies in a quick, flexible and scalable manner that is based on the faceted 
classification paradigm [4] and the Compound Term Composition Algebra [5]. 
Below we sketch the architecture and the functioning of this system and we 
report our experiences from using this system in real applications. 

Taxonomies , i.e. hierarchies of names, is probably the oldest and most widely 
used conceptual modeling tool still used in Web directories, Libraries and the 
Semantic Web (e.g. see XFML [1]). Moreover, the advantages of the taxonomy- 
based conceptual modeling approach for building large scale mediators and P2P 
systems that support semantic-based retrieval services have been analyzed and 
reported in [7,6,8]. However, building very big taxonomies is a laborious task 
vulnerable to errors and management/scalability deficiencies. One method for 
building efficiently a very big taxonomy is to first define a faceted taxonomy (i.e. 
a set of independently defined taxonomies called facets) like the one presented 
in Figure 1, and then derive automatically the inferred compound taxonomy i.e. 
the taxonomy of all possible compound terms (conjunctions of terms) over the 
faceted taxonomy. Faceted taxonomies carry a number of well known advantages 
over single hierarchies in terms of building and maintaining them, as well as 
using them in multicriteria indexing (e.g. see [3]). FASTAXON is a system for 
building big (compound) taxonomies based on the above mentioned idea. Using 
the system, the designer at first defines a number of facets and assigns to each 
one of them one taxonomy. After that the system can generate dynamically (and 
on the fly) a navigation tree that allows to the designer (as well to the object 
indexer or end user) to browse the set of all possible compound terms. 

A drawback, however, of faceted taxonomies is the cost of avoiding the in- 
valid (meaningless) compound terms, i.e. those that do not apply to any object 
in the domain. Let’s consider the faceted taxonomy of Figure 1. Clearly we can- 
not do any winter sport in the Greek islands (Crete and Cefalonia) as they never 
have enough snow, and we cannot do any sea sport in Olympus because Olym- 
pus is a mountain. For the sake of this example, let us also suppose that only 
Cefalonia has a Casino. According to this assumption, the partition of the set 
of compound terms to the set of valid (meaningful) and invalid (meaningless) 
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Fig. 1 . A faceted taxonomy for indexing hotel Web pages 



is shown in Table 1. The availability of such a partition would be very useful 
during the construction of a materialized faceted taxonomy (i.e. a catalog based 
on a faceted taxonomy). It could be exploited in the indexing process for pre- 
venting indexing errors, i.e. for allowing only meaningful compound terms to be 
assigned to objects. It could also aid the indexer during the indexing process, by 
generating dynamically a single hierarchical navigation tree that allows selecting 
the desired compound term by browsing only the meaningful compound terms. 
However, even from this toy example, it is more than obvious that the definition 
of such a partition would be a formidably laborious task for the designer. 

FASTAXON allows specifying the meaningful compound terms in a very flex- 
ible manner. It is the first system that implements the recently emerged Com- 
pound Term Composition Algebra (CTCA) [5]. This allows to the designer to use 
an algebraic expression for specifying the valid compound terms. This involves 
declaring only a small set of valid or invalid compound terms from which other 
(valid or invalid) compound terms are then inferred. For instance, the partition 
shown in Table 1, can be defined using the expression: 

e = (LocationQN Sports) ©p Facilities with the following P and N parameters: 
N = {{Crete, Winter Sports}, {Cefalonia, Winter Sports}} , 

P = {{Cefalonia, SeaSki, Casino}, {Cefalonia, Windsurfing, Casino}}. 
Specifically, FASTAXON provides an Expression Builder for formulating CTCA 
expressions in a flexible, interactive and guided way. Only the expression that 
defines the desired compound terminology is stored (and not the inferred parti- 
tion), as an inference mechanism is used to check (in polynomial time) whether 
a compound term belongs to the compound terminology of the expression. 

The productivity obtained using FASTAXON is quite impressive. The so far 
experimental evaluation has shown that in many cases a designer can define from 
scratch a compound taxonomy of around 1000 indexing terms in some minutes. 
FASTAXON has been implemented as a client/server Web-based system written 
in Java. The server is based on the Apache Web server, the Tomcat application 
server and uses MySQL for persistent storage. The user interface is based on 
DHTML (dynamic HTML), JSP (Java Server Pages) and Java Servlet technolo- 
gies (J2EE). The client only needs a Web browser that support JavaScripts (e.g. 
Microsoft Internet Explorer 6). Future extensions include modules for import- 
ing and exporting XFML [1] and XFML+CAMEL [2] files. FASTAXON will 
be published under the VTT Open Source Licence within 2004 (for more see 
http://fastaxon.erve.vtt.fi/). 



FASTAXON: A System for FAST (and Faceted) TAXONomy Design 843 



Table 1. The Valid and Invalid compound terms of the example of Figure 1 



j Valid j 


| Invalid j 


Earth, AllSports 
Finland, AllSports 
Crete, AllSports 
Reth., AllSports 
Earth, SeaSports 
Finland, SeaSports 
Cefal., SeaSports 
Heraklio, SeaSports 
Greece, WinterSp. 
Olympus, WinterSp. 
Greece, SeaSki 
Crete, SeaSki 
Reth., SeaSki 
Earth, WindSurf. 
Finland, WindSurf. 
Cefal., WindSurf. 
Heraklio, WindSurf. 
Greece, SnowB. 
Olympus, SnowB. 
Greece, SnowSki 
Olympus, SnowSki 
Greece, AllSports, Cas. 
SeaSports, Cas. 

Cefal., SeaSports, Cas. 
Greece, WinterSp., Cas. 
Greece, SeaSki, Cas. 
Earth, WindSurf., Cas. 
Cefal., WindSurf., Cas. 
Greece, SnowB., Cas. 
Greece, SnowSki, Cas. 


Greece, AllSports 
Olympus, AllSports 
Cefal., AllSports 
Heraklio, AllSports 
Greece, SeaSports 
Crete, SeaSports 
Reth., SeaSports 
Earth, WinterSp. 
Finland, WinterSp. 
Earth, SeaSki 
Finland, SeaSki 
Cefal., SeaSki 
Heraklio, SeaSki 
Greece, WindSurf. 
Crete, WindSurf. 

Reth., WindSurf. 

Earth, SnowB. 

Finland, SnowB. 

Earth, SnowSki 
Finland, SnowSki 
Earth, AllSports, Cas. 
Cefal., AllSports, Cas. 
SeaSports, Cas. 

Earth, WinterSp., Cas. 
Earth, SeaSki, Cas. 
Cefal., SeaSki, Cas. 
Greece, WindSurf., Cas. 
Earth, SnowB., Cas. 
Earth, SnowSki, Cas. 


Olympus, SeaSports 
Crete, WinterSp. 

Reth., WinterSp. 
Olympus, SeaSki 
Crete, SnowB. 

Reth., SnowB. 

Crete, SnowSki 
Reth., SnowSki 
Olympus, SeaSports, Cas. 
Cefal., WinterSp., Cas. 
Heraklio, WinterSp., Cas. 
Olympus, WindSurf., Cas. 
Cefal., SnowB., Cas. 
Heraklio, SnowB., Cas. 
Cefal., SnowSki, Cas. 
Heraklio, SnowSki, Cas. 
Crete, AllSports, Cas. 
Heraklio, AllSports, Cas. 
Reth., SeaSports, Cas. 
Olympus, WinterSp., Cas. 
Reth., SeaSki, Cas. 

Crete, WindSurf., Cas. 
Heraklio, WindSurf., Cas. 
Olympus, SnowSki, Cas. 
Finland, SeaSports, Cas. 
Finland, SeaSki, Cas. 
Finland, SnowSki, Cas. 


Cefal., WinterSp. 
Heraklio, WinterSp. 
Olympus, WindSurf. 
Cefal., SnowB. 

Heraklio, SnowB. 

Cefal., SnowSki 
Heraklio, SnowSki 
Crete, WinterSp., Cas. 
Reth., WinterSp., Cas. 
Olympus, SeaSki, Cas. 
Crete, SnowB., Cas. 
Reth., SnowB., Cas. 
Crete, SnowSki, Cas. 
Reth., SnowSki, Cas. 
Olympus, AllSports, Cas. 
Reth., AllSports, Cas. 
Crete, SeaSports, Cas. 
Heraklio, SeaSports, Cas. 
Crete, SeaSki, Cas. 
Heraklio, SeaSki, Cas. 
Reth., WindSurf., Cas. 
Olympus, SnowB., Cas. 
Finland, AllSports, Cas. 
Finland, WinterSp., Cas. 
Finland, WindSurf., Cas. 
Finland, SnowB., Cas. 
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1 Introduction 

The management and exchange of knowledge in the Internet has become the 
cornerstone of technological and commercial progress. In this fast-paced envi- 
ronment, the competitive advantage belongs to those businesses and individuals 
that can leverage the unprecedented richness of web information to define busi- 
ness partnerships, to reach potential customers and to accommodate the needs 
of these customers promptly and flexibly. The Semantic Web vision is to provide 
a standard information infrastructure that will enable intelligent applications 
to automatically or semi-automatically carry out the publication, the searching, 
and the integration of information on the Web. This is to be accomplished by 
semantically annotating data and by using standard inferencing mechanisms on 
this data. This annotation would allow applications to understand, say, dates 
and time intervals regardless of their syntactic representation. For example, in 
the e-business context, an online catalog application could include the expected 
delivery date of a product based on the schedules of the supplier, the shipping 
times of the delivery company and the address of the customer. The infrastruc- 
ture envisioned by the Semantic Web would guarantee that this can be done 
automatically by integrating the information of the online catalog, the supplier 
and the delivery company. No changes to the online catalog application would be 
necessary when suppliers and delivery companies change. No syntactic mapping 
of metadata will be necessary between the three data repositories. 

To accomplish this, two things are necessary: (1) the data structures must 
be rich enough to represent the complex semantics of products and services and 
the various ways in which these can be organized; and (2) there must be flexible 
customization mechanisms that enable multiple customers to view and integrate 
these products and services with their own categories. Ontologies are the answer 
to the former, ontology views are the key to the latter. 

We propose ontology views as a necessary mechanism to support the ubiq- 
uitous and collaborative utilization of ontologies. Different agents (human or 
computational) require different organization of data and different vocabularies 
to suit their information seeking needs, but the lack of flexible tools to customize 
and evolve ontologies makes it impossible to find and use the right nuggets of 
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information in such environments. When using an ontology, an agent should be 
able to introduce new classes using high level constraints, and define contexts 
to enable efficient, effective and secure information searching. In this paper we 
present a framework that enables users to design customized ontology views 
and show that the views are the right mechanism to enhance the usability of 
ontologies. 

2 Ontology Views 

Databases views and XML views [1-3, 5-7], have been used extensively to both 
tailor data to specific applications and to limit access to sensitive data. Much like 
traditional views, it is imperative for ontology views to provide a flexible model 
that meets the demands of different applications as well as different categories 
of users. For example, consider an online furniture retailer, OLIE, that wants 
to take advantage of ontology-based technologies and provide a flexible and 
extensible information model for its web-based applications. The retailer creates 
an ontology that describes the furniture inventory, manufacturers and customer 
transactions. Let us assume that two primary applications use this ontology. 
The first application, a catalog browsing application , allows customers to browse 
the furniture catalog and make online purchases, while the second application, 
a pricing application, allows marketing strategists to define sales promotions 
and pricing. The information needs of these two applications are very different. 
For example, customers should not be allowed to access the wholesale price of 
a furniture piece. Similarly, an analyst is only concerned with attributes of a 
furniture piece that describe it as a marketable entity, not those that refer to its 
dimensions, which are primarily of interest to customers. 

The catalog browsing and the pricing applications need to take these re- 
strictions into consideration when querying and displaying the ontology to their 
respective users. If the ontology changes, regardless of how powerful the infer- 
encing is, the applications will invariably need to change their queries. This 
hard-coded approach to accessing ontologies is costly in development time and 
error prone, and underlies the need for a flexible model for ontology views. In this 
case, it is desirable to be able to define the MarketingView and CustomerView 
as in the ontology fragment shown in Figure 1. 

Despite their similarities with relational database views, ontology views have 
also differentiating characteristics. First, ontology views need to be first-class cit- 
izens in the model, with relations and properties just like regular ontology classes. 
For example, suppose that the pricing analyst wants to define the PreferredCus- 
tomer category, as a customer with a membership card that offers special prices 
for furniture and accessories. Now the catalog application needs a Preferred- 
CustomerView, similar to the CustomerView defined in Figure 1, adding the 
promotional price for card holders. It would also be desirable to define the Pre- 
ferredCustomerView as a subclass of CustomerView, so that whenever some 
information is added or removed to the CustomerView, the changes are auto- 
matically reflected in the PreferredCustomerView. Notice that, in this case, we 
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Fig. 2. Inheritance Hierarchy in Ontology 
Views. 



have an inheritance hierarchy within the views, that is PreferredCustomerView 
IsA CustomerView, as shown in Figure 2. 

Second, views need to be used as contexts to interpret further queries. For 
example, suppose that the marketing analyst defines the class Seasonalltems as 
a set of furniture pieces or accessories that have unusually high volume of sales 
in a given shopping season, based on previous years sales statistics. The analyst 
also defines Christmasltems, Summerltems and Fallltems as refinements of Sea- 
sonalltems. When a customer queries for information on large oval tablecloths 
in Christmas, items in the Christmasltems view should be selected, and the in- 
formation on each item should be filtered through either the CustomerView or 
the PreferredCustomerView, depending on the type of customer. 

It is easy to see that views need to represent structures of the ontology (like 
CustomerView) as well as new classes defined through constraints, much like 
OWL [4] class operators. In fact, the views proposed here are extensions to 
OWL classes and expressions, as discussed in Section 2.1. 



2.1 CLOVE — A View Definition Language for OWL 

We focus on the systematic description and management of views as first-class 
objects in ontologies, as described in the scenarios above. To the best of our 
knowledge, this work is the first of its kind in defining ontology views as first- 
class objects. In particular, we extend OWL [4], a recently proposed standard 
of W3C, to describe ontologies and their views. OWL allows the definition of 
classes from other classes through set operations, thereby providing the basic 
infrastructure support for defining simple views, like the Seasonalltems category 
described above. However, it has limitations. First, even though ontology views 
can be considered as classes that are derived from the underlying ontology, they 
can also refer to subnetworks or structures of classes and relations (like in the 
case of the CustomerView), underscoring the need for a language rich enough 
to define both types of views. Second, we need to define a set of standard rules 
that govern the creation and management of these views, as well as their scope 
and visibility. While the later is still an open problem, there are some simple 
mechanisms that allow adequate view definitions. In this paper, we present an 
overview of a high level constraint language - CLOVE ( Constraint Language 
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for Ontology View Environments) that extends OWL constraints. We employ 
CLOVE as the underlying mechanism to support the creation of OWL views. 

A view in CLOVE is defined by a set of (1) subject clauses; (2) object clauses; 
and (3) variable definitions. The subject clauses describe the constraints under 
which the view is valid, as well as the range of instances for which the view is 
applicable. The subject clauses are used to check whether the view (if declared 
active) should be used in the current query. CLOVE does not restrict the number 
of subjects of a view. For example, the CustomerView defines as subjects all 
types of customers. It is also possible to not specify the subject of a view by 
using the keyword ANY , in which case, the CLOVE runtime system uses the 
view to filter all queries when the view is active. A subject is defined through a 
NavigationExpression which is described below The objects are expressions that 
describe the content of a view, and have the form: 

{INCLUDE I EXCLUDE} NavigationExpression ConstraintExpression 

where the keywords INCLUDE and EXCLUDE indicate whether the classes 
or instances satisfying the clause are included or excluded from the view. The 
NavigationExpression is a Boolean expression of relations or properties that are 
navigated from the set of currently evaluated classes and instances or from a 
variable or name included in the expression. For example, ?Object SUBSUMES 
and IS- A are valid navigation expressions. 

The ConstraintExpression is an extension of an OWL expression. In its sim- 
plest form is just the name of a class or instance, but it can also describe the 
content of its data (the WITH CONTENT in Figure 3) or the data type of the 
properties of a class or instance (with WITH TYPE) among others. In the exam- 
ple below, Customer and MarketableEntity are valid and very simple- constraint 
expressions. 

CLOVE also defines variables that can be directly used in clauses, as well as 
it allows users to define their own variables. A variable in CLOVE is preceded by 
the question mark. In Figure 3, the variable ?object refers to the currently evalu- 
ated content of the view. There is also a pre-defined variable, fsubject that refers 
to all the currently evaluated subjects of the view. User-defined variables can 
be used to define scripts or procedures to calculate data from the existing data, 
like Last-YearXmSales in Figure 3, which is evaluated from existing properties of 
LastYearSales, the November and December sales. 

The full specification of CLOVE is beyond the scope of this paper but Fig- 
ure 3 gives a brief example of the creation of some of the OWL views of the 
scenarios above using CLOVE. 

CLOVE allows arbitrary relations among views, in particular, inheritance 
(that is, Is A). CLOVE also allows the dynamic creation of classes to evaluate 
views (like the LastYearXmSales as a refinement of LastYearSales in Figure 3. 
After defining them, views can be activated or de-activated by their authors or 
by users with administrative privileges. The runtime system requires that every 
query is tagged with information about the user, which is associated to a class in 
the ontology. Queries are evaluated with respect to the currently active views in 
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CustomerViewI 

Subject : IS-A Customer 

Object: INCLUDE ATTRIBUTE-OF FurniturePiece 
Object: INCLUDE ATTRIBUTE-OF Accessory 
Object: EXCLUDE ?0bject SUBSUMES MarketableEntity 
Object: INCLUDE ?0bject HAS-PRICE Retail 
Object: INCLUDE ?0bject HAS-PRICE Sale 

} 

Pref erredCustomerView IS-A CustomerView{ 

Subject: ANY 

} 

Seasonalltems { 

Subject: Furnitureltem 



Christmasltems IS-A SeasonalItems{ 

Subject: ANY 

Define: LastYearXmSales IS-A LastYearSales WITH CONTENT 
LastYearSales .November+LastYearSales .December 
Object: Include ?object HAS-STAT ?LastYearXmSales > 

1.2 * (MarketableEntity HAS-STAT LastYearSales . AvgMonthlySales) 



Fig. 3. Creating views with CLOVE. 



the order that they were defined. The result is that queries against the ontology 
are automatically filtered by one or more views, according to the current user 
context. 

All users with access to the ontology should be able to create views. This 
is one of the most important design principles of CLOVE. However, not every 
view should be used to filter every query, tlrats why the CLOVE runtime system 
keeps track of view dependencies and who created them, with a simple access 
control system based on user IDs. 



3 Conclusions 

The Semantic Web brings forth the possibility of heterogeneous ontologies that 
are universally accessible to arbitrary agents through the Internet. These agents 
may not only access these ontologies, but also customize their organization and 
information with their own knowledge and communicate it in turn to their own 
users. Hence, the ability to create views and contexts on ontologies becomes as 
crucial as the view mechanism in traditional database technologies, providing a 
scope and filtering of information necessary to modularize and evolve ontologies. 
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However, ontology views are not just straightforward extensions of database 
views. We have designed and implemented a framework that explores the issues 
of authoring and management of views and their underlying ontologies. Among 
them, we have focused on the dual nature of views as classes in the ontology and 
contexts to interpret new queries. As contextual elements, views are structures 
of classes and as classes they have relations to other views and even to other 
classes. We have also implemented a constraint language, CLOVE that takes into 
account this duality and allows users to both create and query views with an 
easy-to-use, natural interface. 
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Abstract. In this work we present iRM - an OMG MOF-compliant repository 
system that acts as custom-defined application or system catalogue. iRM en- 
forces structural integrity using a novel approach. iRM provides declarative 
querying support. iRM finds use in evolving data intensive applications, and in 
fields where integration of heterogeneous models is needed. 



1 Introduction 

Repository systems are “shared databases about engineered artifacts” [2]. They facili- 
tate integration among various tools and applications, and are therefore central to an 
enterprise. Loosely speaking repository systems are data stores with a customizable 
system catalogue - a new and distinguishing feature. Repository systems exhibit an 
architecture comprising several layers of metadata [3], e.g. repository application’s 
instance data (M 0 ), repository application’s model (M,), meta-model (M.J and meta- 
meta-model (M 3 ). In comparison to database systems repository systems contain an 
additional metadata layer, M 3 , allowing for a custom definable and extensible system 
catalogue (M 2 ). Preserving consistency between the different layers is a major chal- 
lenge specific to repository systems. A declarative query language with higher order 
capabilities is needed to provide model (schema) independent querying. Treating data 
and metadata in uniform manner is a key principle when querying repository objects 
on different meta-layers. Areas of applications are domain-driven application engi- 
neering, scientific repositories, data-intensive Web applications. In this demonstration 
we present an OMG MOF based repository system developed in the frame of the iRM 
(Fig. 1) project [1], 



2 iRM/RMS Repository System and mSQL Query Language 

Structural consistency is one of the key issues in repository systems and must be en- 
forced automatically by the RMS. It ensures that the structure of the repository ob- 
jects conforms to its definition on the upper meta-layer. Without structural integrity 
the repository data will be inconsistent (no type conformity), which has profound 
consequences on any repository applications, since they rely heavily on reflection. 
The concept of repository transactions is an integral part of the structural consistency 
of the repository data. Repository systems must be able to handle concurrent multi- 
client access, i.e. concurrent atomic sets of operations from multiple repository cli- 
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ents. Implementing isolation requires extension of the traditional locking mechanisms, 
e.g. multi-granularity locking mechanisms in OODB [4]. In iRM/RMS we introduce 
“instance lattice’’ in addition to aggregation and class (type) lattices. 




Fig. 1. Logical Architecture of the iRM Project 



We introduce mSQL (meta SQL) as a query language-extension of SQL, to ac- 
count for the specifics of the repository systems. The mSQL syntax is inspired by 
SchemaSQL [5]. The main value of mSQL lies in its declarative nature - especially 
beneficial in the context of repository systems. Given only a programmatic access, 
through the RMS API, a repository application needs additional code to load reposi- 
tory objects. mSQL queries significantly simplify this task and reduce application 
complexity. mSQL allows model independent querying: querying attributes values in 
classes on meta-layer instances of a specified meta-class. 



3 The Demonstration 

The demonstration will show the enforcement of structural integrity in iRM. We will 
consider several cases: (a) creation of new models and import of data; (b) modifica- 
tion of existing meta-metamodels (M2) with existing Ml models and instance data; 
(c) concurrent multi-client access. The second and the third cases illustrate the main 
value of structural integrity. We will showcase mSQL queries’ execution, illustrating 
the value of mSQL to repository applications. We shall demonstrate model independ- 
ent querying and dynamic schema discovery with mSQL. 
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This paper presents a demonstration of visXcerpt [BBS03,BBSW03], a visual query 
language for both, standard Web as well as Semantic Web applications. 



Principles of visXcerpt. The Semantic Web aims at enhancing data and service retrieval 
on the Web using meta-data and automated reasoning. Meta-data on the Semantic Web 
is heterogeneous. Several formalisms have been proposed. RDF, Topic Maps and OWL, 
e.g., and some of these formalisms have already a large number of syntactic variants. 
Like Web data, Web meta-data will be highly distributed. Thus, meta-data retrieval for 
Semantic Web applications will most likely call for query languages similar to those 
developed for the standard Web. This paper presents a demonstration of a visual query 
language for the Web and Semantic Web called visXcerpt. visXcerpt is based on three 
main principles. 

First, visXcerpt has been conceived for querying not only Web meta-data, but 
also all kind of Web data. The reason is that many Semantic Web applications will 
most likely refer to both, standard Web and Semantic Web data, i.e. to Web data and 
Web meta-data. Using a single query language well-suited for data of both kinds is 
preferable to using different languages for it reduces the programming effort and hence 
costs and it avoids mismatches resulting from interoperating languages. Second, visX- 
cerpt is a query language capable of inference. The inferences visXcerpt can per- 
form are limited to simple inference like needed in querying database views, in logic 
programming, and in usual forms of Semantic Web reasoning. Offering both, infer- 
ence and querying, in a same language avoids e.g. the impedance mismatch, which is 
commonly arising when querying and inferencing are performed in different processes. 
Third, visXcerpt has been conceived as a mere Hypertext rendering of a textual 
query language. This approach to developing a visual language is fully new. It has sev- 
eral advantages. It results in a visual language tightly connected to a textual language, 
namely the textual language it is a rendering of. This tight connection makes it possible 
to use both, the visual and the textual language, in the development of applications. 
Last but not least, a visual query language conceived as an Hypertext application is 
especially accessible for Web and Semantic Web application developers. 

Further principles of visXcerpt are as follows. visXcerpt is rule-based. visXcerpt is 
referentially transparent and answer-closed. Answers to visXcerpt queries can be ar- 
bitrary XML data. visXcerpt uses (like the celebrated visual database query language 
QBE) patterns for binding variables in query expressions instead of path expressions 
- as do e.g. the Web query languages XQuery and XSLT. visXcerpt keeps queries and 
constructions separated. 
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Language Visualization as Hypertext Rendering. XML and hence modelling languages 
for the Semantic Web based on XML like RDF, Topic Maps and OWL, are visualized 
in visXcerpt as nested, labeled boxes, each box representing an XML element. Graph 
structures are represented using Hyperlinks. Colors are used for conveying the nesting 
depth of XML elements. As visXcerpt’s query and construction patterns can be seen 
as samples, the same visualization can be used for query and construction patterns. 
This makes visXcerpt’s visualization of queries and answer constructions very close 
to the visualization of the data the queries and answer constructions refer to. visX- 
cerpt has interactive features helping for a quick understanding of large programs: 
boxes representing XML elements can be folded and unfolded and semantically related 
portions of programs like e.g. different occurrences of the same variable), can be high- 
lighted. visXcerpt programs can be composed using a novel Copy-and-Paste paradigm 
specifically designed for tree (or term) editing. Patterns are provided as templates to 
support easy construction of visXcerpt programs without in-depth prior knowledge of 
visXcerpt’s syntax. Today’s Web Standards together with Web browsers offer a ideal 
basis for the implementation of a language such as visXcerpt. The visXcerpt prototype 
demonstrated is implemented using only well-established techniques like CSS, EC- 
MAScript, and XSL and, of course, the run time system of the textual query language 
Xcerpt [SB04] (cf. http : / /xcerpt . org). 

Demonstrated Application. The application used for demonstrating visXcerpt is based 
on data inspired by “Friend of a Friend” cf. http : / /xmlns . com/ f oaf / 0 . 1/ ex- 
pressed in various formats, including plain XML and RDF formats. The demonstration 
illustrates the following aspects of the visual query language visXcerpt. 

Standard Web and Semantic Web data can be retrieved using the same visual query 
language, visXcerpt. Meta-data formated in various Semantic Web formats are conve- 
niently retrieved using visXcerpt. visXcerpt queries and answer constructions are ex- 
pressed using patterns that are intuitive and easy to express (cf. [BBS03,BBSW03] for 
examples). Hypertext features are used by visXcerpt such as Hypertext links for fol- 
lowing references forward and backward or different renderings (such as hiding and 
showing of program components or XML elements) so as to help screening large pro- 
grams. Recursive visXcerpt programs are presented and evaluated demonstrating that 
visXcerpt gives rise to a rather simple expression of transitive closures of Semantic Web 
relations and of recursive traversal of nested Web documents. 

This research has been funded within the 6th Framework Programme project REW- 
ERSE number 506779 (cf. http://www.rewerse.net). 
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1 Introduction 

In recent years, ranked retrieval systems for heterogeneous XML data with both struc- 
tural search conditions and keyword conditions have been developed for digital li- 
braries, federations of scientific data repositories, and hopefully portions of the ulti- 
mate Web. These systems, such as XXL [2], are based on pre-defined similarity mea- 
sures for atomic conditions (using index structures on contents, paths and ontological 
relationships) and then use rank aggregation techniques to produce ranked result lists. 
An ontology can play a positive role for term expansion [2], by improving the average 
precision and recall in the INEX 2003 benchmark [3]. 

Due to the users’ lack of information on the structure and terminology of the un- 
derlying diverse data sources, and the complexity of the (powerful) query language, 
users can often not avoid posing overly broad or overly narrow initial queries, thus get- 
ting either too many or too few results. For the user, it is more appropriate and easier 
to provide relevance judgments on the best results of an initial query execution, and 
then refine the query, either interactively or automatically by the system. This calls for 
applying relevance feedback technology in the new area of XML retrieval [ 1 ] . 

The key question is how to appropriately generate a refined query based on a user’s 
feedback in order to obtain more relevant results among the top-k result list. Our demon- 
stration will show an approach for extracting user information needs by relevance feed- 
back, maintaining more intelligent personal ontologies, clarifying uncertainties, re- 
weighting atomic conditions, expanding query, and automatically generating a refined 
query for the XML retrieval system XXL. 



2 Stages of the Retrieval Process 

a. Query Decomposition and Weight Initialization: A query is composed of weighted 
(i.e., differently important) atomic conditions, for example, XML element content con- 
strains, XML element name (tag) constrains, path pattern constrains, ontology similar- 
ity constrains, variable constrains, search space constrains, and output constrains. In 
the XXL system, each atomic condition has an initial weight. If some constrains are 
uncertain, we specify them by the operator Concrete examples are shown in the 
poster. 
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b. Retrieval with Ontology based Similarity Computation: Content index and path 
index structures are pre-computed and used for the relevance score evaluation of result 
item candidates. The global ontology index is built beforehand as a table of concepts 
from WordNet, and frequency-based correlations of concepts are computed statistically 
using large web crawls. To enable efficient query refinement in the following feedback 
iterations, we have a set of strategies to maintain a query- specified personal ontology 
which is automatically generated from fragments of the global ontology. This is the 
source for further query term expansion, as well as ontological similarity computations. 

c. Result Navigation and Feedback Capturing: The retrieved ranking list is visualized 
in a user-friendly way supporting zoom plus focus. Features like group selection and 
re-ranking are supported in our system, which can capture richer feedback at various 
levels, i.e., content, path and overall level. 

d. Strategy Selection for Query Reweighting and Query Expansion: The strategy 
selection module will choose an appropriate rank aggregation function over atomic 
conditions for overall score computation. After each feedback iteration, tuning func- 
tions (e.g., minimum weight algorithm, average weight algorithm, as in [4]), are used to 
derive the relative importance among all atomic conditions, and to update the personal 
ontology [1]. 

e. Adaptable Query Reformulation: Our system is adaptable using reweighting and 
expansion techniques. The open architecture allows us easily add new rank aggregation 
functions, reweighting strategies, or expansion strategies. 

3 Demonstration 

The INEX 2003 benchmark [3] consists of a set of content-and-structure queries and 
content-only queries over 12117 journal articles. Each document in a result set of a 
query is assigned a relevance assessment score provided by human experts. We run our 
method on this data set to show the improvement of average precision and recall using 
relevance feedback with up to four iterations. Our baseline is using only ontology-based 
expansion [2]. We show the comparison between different strategies of rank aggrega- 
tion, query reweighting and expansion. We also show our approach to refine structural 
XML queries based on relevance feedback. 
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1 Introduction 

How to model spatiotemporal changes is one of the key issues in the researches on 
spatiotemporal databases. Due to the inefficiency of previous spatiotemporal data 
models [1,2], none of them has been widely accepted so far. 

This paper investigates the types of spatiotemporal changes and the approach to 
describing spatiotemporal changes. The semantics of spatiotemporal changes are 
studied and a systematic classification on spatiotemporal changes is proposed, based 
on which a framework of spatiotemporal semantic model is presented. 



2 Semantic Modeling of Spatiotemporal Changes 

The framework for modeling spatiotemporal changes is shown in Fig. 1 as an And/Or 
Tree. Spatiotemporal changes are represented by object-level spatiotemporal changes 
that result in changes of object identities and attribute-level spatiotemporal changes 
that do not change any objects’ identities but only the internal attributes of an object. 
Attribute-level spatiotemporal changes are spatial attribute changes or thematic at- 
tribute changes, which are described by spatial descriptor and attribute descriptor, 
while object-level spatiotemporal changes are discrete identity changes, which are 
represented by history topology. The modeling of spatiotemporal changes as shown 
in Fig.l is complete. The proof can be found in the reference [3]. 



Spatiotemporal Changes 




Object-1 evel 
Spatiotemporal Changes 



Attribute-level 
Spatiotemporal Changes 



History Topology Spatial Descriptor 



Attribute Descriptor 





Discrete Identity Continuous Spatial Discrete Spatial Continuous Attribute Discrete Attribute 
Changes Changes Changes Changes Changes 



Fig. 1. The framework for modeling spatiotemporal changes 
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3 A Framework of Spatiotemporal Semantic Model 

The framework of spatiotemporal semantic model is shown in Fig. 2. The circle nota- 
tion represents identity-level changes, and the triangle notation represents attribute- 
level changes. The attribute descriptor describes the time-varying thematic properties 
of the spatiotemporal object. The spatial descriptor represents the time-varying spa- 
tial value of the spatiotemporal object. And the history topology, which represents 
identity-level changes, describes the life cycle of spatiotemporal objects, such as split 
and mergence. Thus a spatiotemporal object can be defined as a quadruple of object 
identity, spatial descriptor, attribute descriptor and history topology, which is O - < 
OID, SD, AD, HT>. This structure can represent both spatiotemporal data and spatio- 
temporal changes: a static state of a spatiotemporal object can be determined by in- 
putting a definite time value into SD, AD and HT, while a dynamic state during a 
period of time can be obtained by the SD, AD and HT in the period. 




Object 





Fig. 2. The spatiotemporal semantic model 
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1 Introduction 

A closer integration of XML and database systems is actively pursued by re- 
searchers and vendors because of the many practical benefits it offers. Additional 
special benefits can be achieved on temporal information management - an im- 
portant application area that represents an unsolved challenge for relational 
databases [1]. Indeed, XML data model and query languages support: 

— Temporally grouped representations that have long been recognized as a 
natural data model for historical information [2], and 

— Turing-complete query languages, such as XQuery [3], where all the con- 
structs needed for temporal queries can be introduced as user-defined li- 
braries, without requiring extensions to existing standards. 

By contrast, the flat relational tables of traditional DBMSs are not well-suited 
for temporally grouped representations [4]; moreover, significant extensions are 
required to support temporal information in SQL and, in the past, they were 
poorly received by SQL standard committees. 

We will show that (i) XML hierarchical structure can naturally represent the 
history of databases and XML documents via temporally-grouped data models, 
and (ii) powerful temporal queries can be expressed in XQuery without requir- 
ing any extension to current standards. This approach is quite general and, in 
addition to the evolution history of databases, it can be used to support the 
version history of XML documents for transaction-time, valid-time, and bitem- 
poral chronicles [5] . We will demo the queries discussed in [5] and show that this 
approach leads to simple programming environments that are fully-integrated 
with current XML tools and commercial DBMSs. 

2 The Systems ArchlS and ICAP 

In our demo, we first show that transaction-time history of relational databases 
can be effectively published as XML views, where complex temporal queries 
on the evolution of database relations can be expressed in standard XQuery 
[6]. Therefore, we will demonstrate our ArchlS prototype that supports these 
queries efficiently on traditional database systems enhanced with SQL/XML [7]. 
A temporal library of XQuery functions is used to facilitate the writing of the 
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more complex queries and hide some implementation details (e.g., the internal 
representation of ‘now’). We can thus support the complete gamut of histori- 
cal queries, including snapshot and time-slicing queries, element-history queries, 
since and until queries. These temporal queries in XQuery are then mapped and 
executed as equivalent SQL/XML queries executing on the RDBMS. 

The next topic in the demo is the application of our temporal representa- 
tions and queries to XML documents of arbitrary nesting complexity. In the 
ICAP project [8], we store the version history of documents of public inter- 
est in ways that assure that powerful historical queries can be easily expressed 
and supported. Examples include successive versions of standards and normative 
documents, such as the UCLA course catalog [9], and the W3C Xlink specs [10], 
which are issued in XML form. Toward this objective, 

(i) we use structured diff algorithms [11-14] to compute the validity periods 
of the elements in the multi- version document, 

(ii) we use the output generated by the diff algorithm, to build a concise repre- 
sentation history of the document using a temporally grouped data model. 
Then, on this representation, 

(iii) we use XQuery, enhanced with the library of temporal functions, to formu- 
late temporal queries on the evolution of the document and its content. 

The ICAP system also provides additional version-support services, includ- 
ing the ability of color-marking changes between versions, and annotating the 
changes with explanations and useful metainformation. 
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1 Overview of the SVMgr Tool 

The SVMgr tool is an integrated development environment for the management 
of a relational database supporting schema versioning, based on the multi-pool 
implementation solution [2]. In a few words, the multi-pool solution allows the 
extensional data connected to each schema version (data pool) to evolve inde- 
pendently from each other. The multi-pool solution is more flexible and poten- 
tially useful for advanced applications as it allows the coexistence of different 
full-fledged conceptual viewpoints on the mini-world modeled by the database 
[5], and it has partially been adopted also by other authors [3]. The multi-pool 
implementation underlying the SVMgr tool is based on the Logical Storage 
Model presented in [4] and allows the underlying multi-version database to be 
implemented on top of MS Access. The software prototype has been written in 
Java (it is downward compatible with the 1.2 version) and interacts with the 
underlying database via JDBC/ODBC on a MS Windows platform. 

In order to show the multi-pool approach features in practice and test its po- 
tentialities against applications, SVMgr has been equipped with a multi-schema 
query interface, initially supporting select-project-join queries written in the 
Multi-Schema Query Language MSQL [4,5]. Hence, the SVMgr prototype 
represents the first implemented relational database system with schema version- 
ing support which is able to answer multi-schema queries. The MSQL language 
includes two syntax extensions to refer to different schema versions: naming 
qualifiers and extensional qualifiers [4,5]. The former allow users to denote a 
schema object (e.g. an attribute or relation name) through its name used in dif- 
ferent schema versions: for instance, “[SV1:R]” denotes the relation named R in 
schema version SV1. The latter allow users to denote object values as stored in 
different data pools: for instance “SV2:S” denotes the instance of relation S in 
the data pool connected to schema version SV2. Recently, the MSQL language 
has been further developed with the addition of grouping and ordering facilities: 
multi-schema extensions of the SQL GROUP BY, HAVING and ORDER BY clauses, 
and of the SQL aggregate functions are fully supported by the SVMgr tool in its 
current release. 

The SVMgr environment, in its current release (ver 13.02, as of July 2004), 
supports four main groups of functions: 
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Database Content Management, which allow users to inspect and modify 
the contents of the data pools associated to different schema versions. Also 
integrity checks on primary key uniqueness (which may be threatened by the 
execution of schema changes) can be effected. 

Schema Version Management, which allow users to effect schema changes 
and create a new schema version. Supported schema changes are: add, re- 
name, drop a table; add, rename, drop a table column; change the primary 
key of a table. 

Integration Support Tools, which allow users to support the integration ac- 
tivities [1] when the underlying database is used as a data source in an 
heterogeneous environment . 

Multi-schema Queries, which allow users to execute multi-schema SPJ 
queries, by implementing a MSQL query language interface. MSQL queries 
are translated into standard SQL queries which are executed via JDBC on 
the underlying database implementing the Logical Storage Model. 

Users of the tool are supposed to be database administrators, which have 
complete control and responsibility over the database schema and contents, in- 
cluding management of schema versions. Although SVMgr users are supposed 
to have a reasonable knowledge of the main features of schema versioning with 
the multi-pool implementation solution, the prototype has been equipped with 
a user-friendly interface, and requires a minimum knowledge of the underlying 
data model beyond the intuition. Also several correctness checks have been care- 
fully encoded in all the available user functions to protect, as much as possible, 
the database integrity from an incorrect use. 



References 

1. S. Bergamaschi, S. Castano, A. Ferrara, F. Grandi, F. Guerra, G. Ornetti, 
M. Vincini, Description of the Methodology for the Integration of Strongly Het- 
erogeneous Sources, Tech. Rep. D1.R6, D2I Project, 2002, 
http://www.dis.uniromal.it/~lembo/D2I/Prodotti/index.html. 

2. C. De Castro, F. Grandi, and M. R. Scalas. Schema Versioning for Multitemporal 
Relational Databases. Information Systems, 22(5):249-290, 1997. 

3. R. de Matos Galante, A. Bueno da Silva Roma, A. Jantsch, N. Edelweiss, and 
C. Saraiva dos Santos. Dynamic Schema Evolution Management Using Version 
in Temporal Object-Oriented Databases. In Proceedings of the 13th International 
Conference on Database and Expert Systems Applications (DEXA 2002), pages 
524-533, Aix-en-Provence, France, September 2002. Springer Verlag. 

4. F. Grandi, “A Relational Multi-Schema Data Model and Query Language for Full 
Support of Schema Versioning” , Proc. of SEBD 2002, Portoferraio - Isola d’Elba, 
Italy, pp. 323-336. 2002. 

5. F. Grandi, “Boosting the Schema Versioning Potentialities: Querying with MSQL 
in the Multi-Pool Approach”, 2004 (in preparation). 




GENNERE: A Generic Epidemiological Network 
for Nephrology and Rheumatology 

Ana Simonet 1 , Michel Simonet 1 , Cyr-Gabin Bassolet 1 , Sylvain Ferriol 1 , 

Cedric Gueydan 1 , Remi Patriarche 1 , Haijin Yu 2 , Ping Hao 2 , Yi Liu 2 , Wen Zhang 2 , 
Nan Chen 2 , Michel Foret 5 , Philippe Gaudin 4 , Georges De Moor 6 , Geert Thienpont 6 , 
Mohamed Ben Said 3 , Paul Landais 3 , and Didier Guillon 5 

1 Laboratoire TIMC-IMAG, Faculte de Medecine de Grenoble, France 
{Ana . Simonet , Michel . Simonet } @imag . f r 

2 Rui Jin Hospital, Shanghai, China 
3 LBIM, Universite Paris 5 Necker, France 
4 CHU Grenoble, France 
5 AGDUC Grenoble, France 
6 RAMIT, Belgium 

Abstract. GENNERE is a networked information system designed to answer 
epidemiological needs. Based on a French experiment in the field of End-Stage 
Renal Diseases (ESRD), it has been thought of so as to be adapted to Chinese 
medical needs and administrative rules. It has been implemented for nephrology 
and rheumatology at the Rui Jin hospital in Shanghai, but its design and imple- 
mentation have been guided by genericity in order to make easier its adaptation 
and extension to other diseases and other countries. The genericity aspects have 
been considered at the levels of events design, database design and production, 
and software design. This first experiment in China leads to some conclusions 
about the adaptability of the system to several diseases and the multilinguality 
of the interface and in medical terminologies. 

1 Introduction 

GENNERE, for Generic Epidemiological Network for Nephrology and Rheumatol- 
ogy, is a project supported by the European ASIA-ITC program in 2003-2004. This 
project come from a cooperation established between partners of the French MSIS- 
RE1N project [1] and the Rui Jin hospital in Shanghai. The main goal of the project is 
the setting up of a system for the epidemiological follow-up of chronic diseases, fol- 
lowing the MSIS-REIN approach, although adapted to China’s needs [2]. A secon- 
dary objective was to find a methodology to design and implement software that will 
support an easy adaptation to other chronic diseases and to other countries. To dem- 
onstrate the generic aspects which have been emphasized in the GENNERE program, 
two medical fields have been considered in this experiment: nephrology (as in MSIS- 
REIN) and rheumatology. In this demonstration we present the GENNERE systems, 
with an emphasis on genericity aspects in their design and implementation. 

2 The GENNERE Project 

Genericity, which is the major non-medical objective of the GENNERE project, has 
been considered mainly at two levels: the design process and the software implemen- 
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tation. From the design point of view, the methodology followed was: 1) to highlight 
the events and functionalities by categories of users, 2) to define the abstract onto- 
logical model for chronic diseases, 3) to isolate the data specific to a country, 4) to 
design the interfaces blocks. 

Events. In GENNERE the events for nephrology are the same as the events taken 
into account in the French MSIS-REIN project. The events model for rheumatology 
is very similar to that for nephrology, which demonstrates the possibility to constitute 
a core of minimal events for this kind of system. 

Database design. For database design we have used the CASE tool ISIS (Informa- 
tion System Initial Specification) developed at the TIMC laboratory [3]. It enables the 
designer to work at the conceptual level and specify behavioral aspects rather than 
implement them directly at the level of relational tables. Moreover, thanks to its abil- 
ity to check the consistency of a specification and to automatically generate the data- 
base (logical and physical schema), the ISIS system has considerably shortened the 
cycle of knowledge extraction from medical experts and its validation by users. 

Views. To ensure the minimum work when modifying the database schema we have 
been careful to access the database only through views - except for some database 
updates which were too complex to be supported by the DBMS. 

Ontological data model. As in the French model, the core of the conceptual model is 
centered on three generic concepts: PATIENT, FOLLOW UP and TREATMENT. Flowever, 
each concept is derived according to a specific structure for each disease. For exam- 
ple, the TREATMENT concept is much more complex for Rhumatoid Arthritis than for 
ESRD because this illness has several anatomic localizations and can be treated by 
several means: medicines, local manipulations or traditional Chinese medicine. 

Country-specific data. The setting up of the GENNERE system has put into light the 
data which are specific of a given country, e.g., addresses and insurance for patient, 
which depend strongly on the geographical and administrative organization of the 
country; another example is that of Chinese traditional medicine (acupuncture, herbs, 
massages, etc). These categories of data are much country-specific and will have to 
be studied anew when a new country is to be considered. 

Multilingualism. Dealing simultaneously with several languages, while keeping the 
possibility of adding a new language, imposes the choice of an encoding system 
which supports a wide variety of languages [4]. In GENNERE, which must support 
Chinese, English and French, the choice was UTF8-Unicode, which also supports 
most known languages, including other Asian languages. 

Metadata. To present multilingual interfaces without needing to duplicate the inter- 
face code, we built a specific database for the management of the metadata of the 
GENNERE system (Nephrology and Rheumatology). This database contains the 
description of the database objects: tables, columns, domain values (for enumerated 
attributes). It also contains the objects necessary to the various interfaces, e.g., labels 
used in the Graphical User Interface. Each object has a unique identifier and is asso- 
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dated to as many items as different languages, Chinese, English and French at the 
present time. Thus, a value or a label may be rapidly retrieved according to the se- 
lected language. This method also facilitates the addition of a new language: one 
must only fill the metadata tables with the corresponding items in this language. 



3 Conclusion 

Genericity has constituted a major concern in the design and the implementation of 
the GENNERE information system. 

This project showed us a need for tools allowing a cooperative work, especially 
when the team in charge of conceiving and developing the project is culturally, so- 
cially and geographically heterogeneous. This lead to several cycles for knowledge 
extraction and validation in order to come to a consensus. The CASE tool ISIS [3] 
played a very important role for reaching consensus more rapidly among the partners. 

For the success of the MSIS-REIN the human factor has been determining. Beside 
technology improvements, didactic efforts have been made and are still necessary to 
help users and to identify impediments to changes. This factor is at least as important 
in China. 

The GENNERE program in its final development phase will be installed at the Rui 
Jin hospital in Shanghai at the end of the year 2004 on two servers, one for nephrol- 
ogy and one for rheumatology. During the next phase (2005-2006) data integration 
from other centres will be carried out and a data warehouse will be implemented, also 
in a generic way, along with epidemiological and data presentation tools [5]. A Geo- 
graphical Information System, currently under implementation in France, seems very 
promising to support public health decision and will be considered in a later phase 
[6]. 
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Abstract. Webservices are evolving as the paradigm for loosely cou- 
pled architectures. The prospect of automatically composing complex 
processes from simple services is promising. However, a number of open 
issues remain: Which aspects of service semantics need to be explicated? 
Does it suffice to just model datastructures and interfaces, or do we also 
need process descriptions, behavioral semantics, and quality of service 
specifications? How can we deal with heterogeneous service descriptions? 
Should we use shared ontologies or adhoc mappings? This panel shall dis- 
cuss to which extent established techniques from conceptual modelling 
can help in describing services to enable their discovery, selection, com- 
position, negotiation, and invocation. 
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